LLM Production — Serving & Inference

Chapter 12: Large Language Models Section 8

I avoided the details of LLM serving for an embarrassingly long time. I could fine-tune models, run inference notebooks, even build RAG pipelines—but every time someone asked me “how would you actually serve this at scale?” in an interview, I’d mumble something about “put vLLM in front of it” and change the subject. The gap between “I can generate tokens on my laptop” and “I can serve 500 concurrent users at 20 tokens per second each without bankrupting my company” is enormous. Finally the discomfort of not understanding what’s actually happening between a user hitting Enter and a token arriving on their screen grew too great for me. Here is that dive.

LLM serving is the engineering discipline of running large language model inference in production—handling real users, real latency requirements, and very real GPU bills. It grew into its own specialty around 2023, when the gap between “model works in a notebook” and “model works for a million users” became the bottleneck for most AI teams. The core challenge: autoregressive generation is fundamentally different from serving a classification model, and every optimization in this space flows from that difference.

Before we start, a heads-up. We’re going to cover memory bandwidth arithmetic, virtual memory concepts, some light systems engineering, and a fair amount of back-of-envelope cost math. You don’t need to know any of it beforehand. We’ll add the concepts we need one at a time, with explanation.

This isn’t a short journey, but I hope you’ll be glad you came.

What We’ll Cover

The restaurant that serves one dish at a time
Why autoregressive generation changes everything
The KV cache: your most expensive memory hog
Continuous batching: the conveyor belt upgrade
Rest stop and an off-ramp
PagedAttention: virtual memory for attention
Speculative decoding: writing a rough draft first
Tensor parallelism for serving
The serving framework landscape
Streaming and API design
The money: token economics and cost optimization
Guardrails in production
Resources and credits

The Restaurant That Serves One Dish at a Time

Imagine you run a tiny restaurant. You have one chef (your GPU), one kitchen (your GPU memory), and customers who place orders (user prompts). For most of this journey, we’ll be running this restaurant and discovering, one problem at a time, why LLM serving is so hard.

In a normal restaurant—say, one serving image classification—a customer walks in, orders, and the chef prepares the entire dish in one go. Steak? Here’s your steak. Done. The customer leaves. Next customer.

Our restaurant is different. We serve a very peculiar kind of meal: a token tasting menu. Each course is a single token, and the chef cannot begin preparing course number 5 until the customer has received courses 1 through 4. The chef looks at everything served so far, decides what the next course should be, plates it, sends it out, and then starts the next one. A 500-token response means 500 sequential trips from kitchen to table.

This is autoregressive generation—the model produces one token at a time, each conditioned on all previously generated tokens. And it changes everything about how we run this restaurant.

Why Autoregressive Generation Changes Everything

When I first started thinking about serving LLMs, I assumed the bottleneck would be compute—the same way training is compute-bound. I was wrong, and understanding why I was wrong took longer than I’d like to admit.

Let’s make this concrete. Say we have a 7-billion-parameter model stored in 16-bit precision. That’s 14 GB of weights sitting in GPU memory. During the prefill phase—when the model first processes the entire input prompt—there’s a lot of parallel computation happening. Every token in the prompt gets processed simultaneously through the attention layers. The GPU’s thousands of cores are busy. This phase is compute-bound, meaning the math is the bottleneck.

Then the decode phase begins. The model generates one token at a time. Each forward pass for a single new token requires reading all 14 GB of model weights from GPU memory, but the actual computation for that one token is tiny. An NVIDIA A100 has about 312 TFLOPS of compute capacity but only 2 TB/s of memory bandwidth. For that single-token decode step, most of the time is spent waiting for weight data to travel from memory to the compute cores. The arithmetic units sit idle.

This is the memory-bandwidth-bound regime, and it’s the defining characteristic of LLM serving. Going back to our restaurant: the chef (compute) is fast, but the walk from the pantry (memory) to the stove (compute cores) takes forever relative to the actual cooking time. The chef spends most of each course standing in the hallway, not chopping vegetables.

This distinction between prefill and decode creates two very different latency metrics that every LLM serving engineer cares about:

Time to First Token (TTFT) measures how long after the user sends a prompt before the first response token appears. This is dominated by the prefill phase—processing the entire input. For a chat application, you want this under 500 milliseconds or the experience feels sluggish.

Time Per Output Token (TPOT) measures the gap between consecutive output tokens during generation. This is the decode phase latency. For streaming responses that feel natural, you need roughly 50 milliseconds per token—about 20 tokens per second, which matches comfortable reading speed.

The total latency for a response is TTFT plus the number of output tokens multiplied by TPOT. A 500-token response with 400ms TTFT and 40ms TPOT takes about 20.4 seconds. Shaving 10ms off TPOT saves 5 seconds on that response. Shaving 10ms off TTFT saves 10 milliseconds. At scale, TPOT optimization dominates.

The KV Cache: Your Most Expensive Memory Hog

Here’s where our restaurant analogy gets uncomfortable. Remember how the chef needs to look at every previous course to decide the next one? In transformer terms, each new token needs to attend to every prior token. That means computing attention scores between the new token’s query vector and the key vectors of all previous tokens, then using those scores to weight the value vectors.

Without any optimization, generating token number 247 would require re-running the entire model on all 247 tokens from scratch. That’s spectacularly wasteful. The key and value vectors for tokens 1 through 246 haven’t changed—why recompute them?

The solution is the KV cache (key-value cache). As each token is generated, we store its key and value vectors for every attention layer. When generating the next token, we only compute the query, key, and value for the new token, then look up all the cached keys and values from previous tokens. This turns what would be quadratic recomputation into a linear memory read.

Let’s calculate how much memory this actually costs. For each token, at each layer, we store one key vector and one value vector. Each vector has a size equal to the number of attention heads times the head dimension. The formula:

KV cache per token = num_layers × num_heads × head_dim × 2 × bytes_per_element
                                                                  ↑
                                                        (2 for key + value)

Let’s run through our 7B model—say, a Llama-2 7B. It has 32 layers, 32 attention heads, each with a head dimension of 128, and we’re using 16-bit (2-byte) precision:

Per token: 32 layers × 32 heads × 128 dim × 2 (K+V) × 2 bytes
         = 32 × 32 × 128 × 4
         = 524,288 bytes
         ≈ 0.5 MB per token

Half a megabyte per token. That sounds small, until you consider a conversation with a 4,096-token context:

4,096 tokens × 0.5 MB = 2,048 MB ≈ 2 GB per request

Two gigabytes of KV cache for a single user conversation on a 7B model. Now imagine a 70B model like Llama-2 70B, which has 80 layers and 64 attention heads. The KV cache per token jumps to roughly 2.6 MB, and a 4K context eats over 10 GB. On a single A100 with 80 GB of memory, the model weights take 140 GB in fp16 (so you need at least 2 GPUs to hold the weights), and the KV cache for a batch of 8 concurrent users at 4K context each would consume another 80+ GB.

I’ll be honest—the first time I ran these numbers, I didn’t believe them. I recalculated three times. The KV cache for long-context serving can consume more memory than the model weights themselves. This is the central resource management problem of LLM serving, and it explains why so much innovation has gone into making KV cache management smarter.

Back at our restaurant: the KV cache is like the chef’s memory of what every customer has ordered so far. The longer the tasting menu goes on, the more notes the chef has to keep. With enough customers at long enough meals, the notebook becomes bigger than the recipe book.

Continuous Batching: The Conveyor Belt Upgrade

So we have our restaurant with one chef, and customers are trickling in. A natural idea: batch the orders. Instead of making one dish at a time, the chef prepares multiple courses for multiple customers simultaneously. In GPU terms, we process tokens for multiple requests in the same forward pass, amortizing the cost of reading model weights from memory across more useful work.

The naive approach is static batching. Four customers walk in. The chef takes all their orders, starts cooking, and doesn’t accept new customers until all four are done. The problem? Customer A ordered a 50-token appetizer. Customer D ordered a 2,000-token banquet. Customers A, B, and C finish in seconds and sit around waiting for D. Their table is occupied, their KV cache memory is wasted, and no new customers can sit down.

This is exactly what happens with naive LLM batching. Requests get padded to the length of the longest sequence in the batch, GPU cycles are wasted on padding tokens, and shorter requests can’t release their resources until the longest one finishes.

The fix, introduced in the Orca paper (Yu et al., 2022), is continuous batching—also called iteration-level scheduling or in-flight batching. Instead of processing a batch as a monolithic unit, the scheduler makes decisions at every single decode step. When Customer A finishes after 50 tokens, her seat is immediately freed and Customer E (who’s been waiting in line) slides right in. No waiting. No padding. No wasted memory.

Let me walk through a toy example. We have GPU capacity for a batch of 3, and 5 requests arrive:

Time step 1: [Req A (token 1), Req B (token 1), Req C (token 1)]
Time step 2: [Req A (token 2), Req B (token 2), Req C (token 2)]
Time step 3: [Req A (done!),   Req B (token 3), Req C (token 3)]
             → Req A exits, Req D enters
Time step 4: [Req D (prefill), Req B (token 4), Req C (done!)]
             → Req C exits, Req E enters
Time step 5: [Req D (token 1), Req B (token 5), Req E (prefill)]

The batch slots stay fully occupied. No one waits unnecessarily. In benchmarks, continuous batching delivers 10–23x higher throughput compared to static batching, depending on the variance in request lengths. The more diverse the request lengths, the bigger the win.

Our restaurant has upgraded from a sit-down dinner format to a conveyor belt sushi bar. Seats are always full. When a customer leaves, the next one sits down instantly. The chef never idles.

There’s a subtlety I glossed over: notice that in time step 4, Request D is doing its prefill (processing its full prompt) while Requests B and C are doing decode (generating one token each). These two operations have very different computational profiles—prefill is compute-heavy, decode is memory-heavy. Modern serving engines carefully schedule these together to maximize utilization of both compute and memory bandwidth simultaneously. Getting this balance right is an art, and I’m still developing my intuition for when the scheduler makes which tradeoff.

Rest Stop and an Off-Ramp

Congratulations on making it this far. You can stop here if you want.

You now have a solid mental model for how LLM serving works: autoregressive generation is memory-bandwidth-bound during decode, the KV cache stores key-value vectors for all prior tokens and grows with sequence length, and continuous batching keeps your GPU utilization high by scheduling at the iteration level. That’s enough to have a meaningful conversation about LLM infrastructure, understand why serving costs what it does, and evaluate serving solutions at a high level.

It doesn’t tell the complete story, though. We haven’t talked about how to manage KV cache memory efficiently (PagedAttention), how to speed up the inherently sequential decode step (speculative decoding), how to split a model across multiple GPUs for serving (tensor parallelism), or how to reason about the economics of it all. These are the things that separate someone who uses a serving framework from someone who chooses and configures one wisely.

The short version: PagedAttention eliminates KV cache memory waste, speculative decoding gets you 2–3x speedup on decode, and your cloud bill depends more on how well you batch than how fast your single-request latency is. There. You’re 60% of the way.

But if the discomfort of not knowing what’s underneath is nagging at you, read on.

PagedAttention: Virtual Memory for Attention

The KV cache memory problem has a familiar shape. We have a scarce resource (GPU memory) shared among many consumers (concurrent requests) with unpredictable lifetimes (we don’t know how long each response will be). Some requests finish early and free up space. New requests arrive and need allocation. If we pre-allocate the maximum possible context length for every request, we waste enormous amounts of memory on padding. If we try to allocate exactly what’s needed, we get fragmentation as requests come and go.

This is exactly the problem operating systems solved decades ago with virtual memory and paging. And that’s precisely the insight behind PagedAttention, introduced in the vLLM paper (Kwon et al., 2023).

In an OS, physical RAM is divided into fixed-size blocks called pages (typically 4 KB). A process doesn’t get a single contiguous slab of memory. Instead, it gets a page table that maps its logical addresses to physical pages scattered anywhere in RAM. Pages can be allocated on demand, freed individually, and reused immediately.

PagedAttention does the same thing for KV cache. Instead of allocating one contiguous buffer per request, GPU memory is divided into fixed-size KV blocks (each storing key-value data for a fixed number of tokens, say 16). Each request gets a block table—a mapping from its logical token positions to physical KV blocks. The blocks don’t need to be contiguous in memory.

Let me walk through what this looks like. Say our GPU has space for 8 KV blocks, and we have three active requests:

Physical KV blocks in GPU memory:
[Block 0] [Block 1] [Block 2] [Block 3] [Block 4] [Block 5] [Block 6] [Block 7]

Request A (30 tokens, needs 2 blocks): block table → [Block 0, Block 4]
Request B (45 tokens, needs 3 blocks): block table → [Block 1, Block 3, Block 6]
Request C (12 tokens, needs 1 block):  block table → [Block 2]

Free pool: [Block 5, Block 7]

When Request C finishes, Block 2 goes back to the free pool. When Request D arrives, it grabs blocks from the pool. No defragmentation needed. No memory wasted on pre-allocated maximum-length buffers. The waste is at most one partially-filled block per request—less than 4% in practice.

This is a massive improvement. Before PagedAttention, KV cache memory waste was estimated at 60–80% in typical serving scenarios due to over-allocation and fragmentation. PagedAttention brings it down to near zero, effectively doubling or tripling the number of concurrent requests a single GPU can handle.

There’s a bonus feature that falls out naturally from this design: prefix caching. If multiple requests share the same system prompt (which is extremely common in production—every user of your chatbot gets the same 500-token system instruction), the KV cache blocks for that shared prefix can be computed once and reused across all requests. In page table terms, multiple block tables point to the same physical blocks. This saves both memory and the compute cost of re-processing the shared prefix.

Our restaurant analogy one more time: before PagedAttention, we were reserving an entire 20-seat table for every customer, even if they only ordered 3 courses. After PagedAttention, we use bar stools—each customer gets exactly as many stools as they need, scattered around the bar, and we keep a little map of where each customer is sitting. When they leave, the stools are available immediately.

Speculative Decoding: Writing a Rough Draft First

Everything we’ve discussed so far improves throughput—how many tokens per second you can generate across all users. But individual request latency is still fundamentally limited by the sequential nature of autoregressive decoding. You cannot generate token 10 until token 9 exists. Or can you?

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) is a trick that lets you cheat the sequential bottleneck. The intuition: most tokens in a response are predictable. If the prompt says “The capital of France is”, the next token is almost certainly “Paris”. You don’t need a 70B model to guess that. A 1B model would get it right too.

The strategy has two players: a small, fast draft model and the large, accurate target model. Here’s how they work together:

The draft model quickly generates a sequence of K candidate tokens (say, K=5). It’s small, so this is fast. Then the target model processes all K candidate tokens in a single forward pass—this is the key insight, because during prefill, the target model can process multiple tokens in parallel. The target model computes what it would have generated at each position and compares that to what the draft model proposed.

Let me trace through an example. The context so far is “The weather today is” and we’re generating the next 5 tokens:

Draft model generates: ["sunny", "and", "warm", "with", "rain"]

Target model verifies all 5 in one forward pass:
  Position 1: target agrees "sunny"  → ACCEPT
  Position 2: target agrees "and"    → ACCEPT
  Position 3: target agrees "warm"   → ACCEPT
  Position 4: target says "in"       → REJECT (draft said "with")

Result: accept tokens 1-3 ("sunny and warm"), discard the rest.
Target model samples its own token for position 4: "in"

Net: we got 4 tokens (3 accepted + 1 resampled) for the cost of
     one draft run + one target forward pass,
     instead of 4 separate target forward passes.

The mathematical guarantee that makes this trustworthy: the acceptance criterion is designed so that the final output distribution is identical to what you’d get from running the target model alone. No quality degradation. The speedup comes from the fact that the target model’s verification pass processes K tokens in parallel (like a prefill step), which is compute-bound and efficient, rather than doing K sequential decode steps, which are memory-bandwidth-bound and slow.

In practice, speculative decoding delivers 2–3x speedup on latency when the draft model’s acceptance rate is high (70–90% of tokens accepted). The acceptance rate depends on how well the draft model predicts the target model’s output. For formulaic text (code, structured data), acceptance rates are high. For creative, high-entropy text, they’re lower, and the speedup shrinks.

My favorite thing about speculative decoding is that, aside from the high-level explanation I gave, the acceptance criterion involves a somewhat subtle rejection sampling scheme that preserves the exact output distribution. Getting the math to work out so there’s zero quality loss is elegant—and it’s one of those things where no one anticipated how well it would work in practice until people tried it.

Tensor Parallelism for Serving

When a model is too large to fit on a single GPU—which is the case for anything above about 13B parameters at fp16—we need to split it across multiple GPUs. For serving (as opposed to training), tensor parallelism is the preferred approach because it splits individual matrix multiplications across GPUs, keeping latency low.

The core idea: a large matrix multiplication can be decomposed into smaller ones that run in parallel on different GPUs, with a small communication step to combine results.

Consider a weight matrix of shape [4096, 16384] in a feed-forward layer. With 4 GPUs, we can split this by columns: each GPU gets a [4096, 4096] slice. Every GPU receives the same input, multiplies by its slice, and the results get concatenated. This is column parallelism.

Alternatively, we split by rows: each GPU gets a [1024, 16384] slice, and the input is also split. Each GPU computes a partial result, and the results are summed (an all-reduce operation). This is row parallelism.

In practice, transformer serving uses both. The attention Q/K/V projections use column parallelism—each GPU handles a subset of attention heads, which is natural since heads are independent. The output projection and feed-forward layers use a combination where one matrix is split by columns and the following one by rows, so only one all-reduce is needed per pair.

The catch is communication overhead. The all-reduce requires every GPU to send data to every other GPU. With NVLink (the high-bandwidth interconnect within a single machine), this takes about 10 microseconds for typical tensor sizes. Across machines over InfiniBand, it’s 10–100x slower. This is why tensor parallelism works well within a single 8-GPU node but breaks down across nodes. For cross-node scaling, pipeline parallelism (splitting by layers rather than within layers) has lower communication overhead but higher latency, making it more common for offline batch workloads than latency-sensitive serving.

The Serving Framework Landscape

With all these techniques—KV cache, continuous batching, PagedAttention, speculative decoding, tensor parallelism—you might wonder whether you need to implement them yourself. You don’t. Several frameworks package all of this into deployable systems. But choosing between them is a real decision, and the tradeoffs are not obvious.

vLLM is the most widely adopted open-source serving engine as of 2024. It pioneered PagedAttention, supports continuous batching, offers an OpenAI-compatible API out of the box, and runs on both NVIDIA and AMD GPUs. It’s written in Python with custom CUDA kernels for the hot path. Setup takes about 10 minutes. If you have no other constraints, start here.

TensorRT-LLM is NVIDIA’s optimized inference engine. It compiles models into highly tuned CUDA kernels, supports aggressive quantization (FP8, INT4 natively), and delivers the highest raw throughput and lowest latency on NVIDIA hardware. The cost is complexity: setup involves model compilation steps that can take hours, and debugging is harder because you’re working with a C++ runtime. It only works on NVIDIA GPUs. For teams with NVIDIA hardware and dedicated ML infrastructure engineers, it’s the performance ceiling.

Text Generation Inference (TGI) from Hugging Face is built in Rust for reliability and ease of deployment. It has native integration with the Hugging Face model hub, which means deploying a new model is often a one-line command. It supports multi-GPU inference and works on AMD GPUs. Performance is good but not at the frontier. Where TGI shines is time-to-production: you can go from “I found a model on the hub” to “it’s running behind an API” in minutes.

SGLang takes a different angle. Its primary innovation is programmable generation—you can enforce output structure (JSON schemas, regex patterns) and build branching logic directly into the generation process. For agentic applications where the LLM needs to produce structured output that feeds into downstream code, SGLang handles constraints that other frameworks bolt on as afterthoughts.

Ollama is the outlier in this list. It’s not designed for production scale—it’s designed for running LLMs locally on your own machine. Under the hood, it uses llama.cpp, a C++ inference engine that runs efficiently on CPUs and consumer GPUs using GGUF-format quantized models. I include it because many engineers encounter Ollama first, and understanding that it occupies a completely different niche (local development, prototyping, privacy-sensitive workloads) prevents the mistake of trying to scale it to serve 500 users.

I still sometimes get tripped up on framework selection because the landscape shifts so fast. A framework that was experimental six months ago might be production-ready now, and benchmarks from three months ago might be irrelevant. The safest strategy: start with vLLM (broadest community, fastest iteration), and switch to TensorRT-LLM if you’ve measured that you need more performance and have the engineering team to handle the complexity.

Streaming and API Design

Token-by-token generation has a user experience implication that traditional ML serving never had to deal with: streaming. Instead of waiting 20 seconds for a complete 500-token response, users see tokens appearing in real time, creating the illusion (well, the reality) of the model “thinking aloud.”

The technical mechanism is Server-Sent Events (SSE). The client sends a request, the server holds the connection open, and pushes each token (or small chunk of tokens) as it’s generated. The client renders them incrementally. The OpenAI API format has become the de facto standard for this: each chunk is a JSON object with a delta field containing the new token.

data: {"choices": [{"delta": {"content": "The"}}]}
data: {"choices": [{"delta": {"content": " capital"}}]}
data: {"choices": [{"delta": {"content": " of"}}]}
data: {"choices": [{"delta": {"content": " France"}}]}
data: {"choices": [{"delta": {"content": " is"}}]}
data: {"choices": [{"delta": {"content": " Paris"}}]}
data: {"choices": [{"delta": {"content": "."}}]}
data: [DONE]

Every major serving framework (vLLM, TGI, SGLang) supports OpenAI-compatible streaming APIs, which means you can swap backends without changing client code. This is a quiet but enormously practical standardization that happened without any formal specification process—everyone converged on OpenAI’s format because their client libraries were already everywhere.

There are two design decisions that trip people up in production. First, timeouts: a streaming response can last 30 seconds or more for long outputs, and many default proxy and load balancer configurations will kill connections that “seem idle” even though tokens are flowing. You need to either send periodic keep-alive comments or configure your infrastructure to expect long-lived connections. Second, error handling mid-stream: if the model hits an error at token 247 out of 500, the client has already rendered 246 tokens. You need a protocol for signaling errors within the stream, and client code that handles partial responses gracefully.

The Money: Token Economics and Cost Optimization

At some point, every LLM serving conversation becomes a conversation about money. And it should, because the economics of LLM inference are unlike anything in traditional ML. The cost isn’t “pennies per prediction”—it’s “dollars per conversation.”

Let’s run the numbers for two scenarios: using an API versus self-hosting.

API pricing is straightforward. As of 2024, a frontier model like GPT-4 Turbo charges roughly $10 per million input tokens and $30 per million output tokens. For a chatbot handling 10,000 conversations per day, each averaging 500 input tokens and 300 output tokens, the daily cost is:

Input:  10,000 × 500  = 5M tokens   × $10/M  = $50/day
Output: 10,000 × 300  = 3M tokens   × $30/M  = $90/day
Total: $140/day ≈ $4,200/month

Self-hosted math is murkier. Suppose you run Llama-2 70B on 2 A100 GPUs in the cloud at $3/hour each. With continuous batching and decent utilization, you might get 200,000 tokens/hour throughput.

Cost per hour: 2 GPUs × $3/hr = $6/hr
Tokens per hour: 200,000
Cost per 1K tokens: $6 / 200 = $0.03/1K tokens

For the same chatbot (8M tokens/day):
$0.03 × 8,000 = $240/day ≈ $7,200/month

Wait—self-hosting is more expensive? It can be, especially at moderate scale with a model as large as 70B. But the math shifts dramatically with three levers:

Model size. A well-tuned 7B or 13B model can handle many tasks that don’t need 70B-level capability. The token throughput is 5–10x higher per GPU, and you need fewer GPUs. Self-hosted 7B inference can cost $0.003–0.008 per 1K tokens.

Utilization. The $6/hour GPU cost is constant whether it’s processing tokens or sitting idle. At 30% utilization (common for bursty traffic), your effective cost per token triples. At 90% utilization (achievable with continuous batching and sufficient traffic), the economics flip hard in favor of self-hosting.

Model cascading. Route easy queries (“What’s my account balance?”) to a cheap 7B model, and only send complex queries (“Analyze this legal contract for risks”) to the expensive 70B model or API. A well-designed router can send 70–80% of traffic to the small model, slashing costs dramatically.

The tipping point: below roughly 50 million tokens per month, APIs are almost always cheaper and operationally simpler. Above 500 million tokens per month with predictable traffic, self-hosting starts to win—but you’re paying in engineering complexity, on-call rotations, and GPU procurement headaches. Between 50M and 500M is the gray zone where the answer depends on your team’s infrastructure maturity.

Other cost optimization techniques that compound:

Prompt caching. If your system prompt is 500 tokens and identical across all requests, prefix caching (enabled by PagedAttention) avoids recomputing those 500 tokens for every request. At 10,000 requests/day, that’s 5M tokens of prefill computation saved daily.

Quantization. Running a model in INT4 or INT8 instead of FP16 roughly halves or quarters the memory footprint, letting you fit the model on fewer GPUs and increasing throughput. Quality degradation depends on the model and task, but for many production workloads, INT4 quantization loses less than 1% on benchmarks.

Batching awareness in application design. If your application can tolerate 500ms extra latency, batching more requests together dramatically improves throughput per dollar. Real-time chat needs low latency. Background summarization, document processing, and batch classification can tolerate higher latency for lower cost.

Guardrails in Production

Here’s where the concerns shift from “can we serve tokens fast” to “should we serve these tokens at all.” Production LLM systems need safety layers that don’t exist in traditional ML serving, and bolting them on as an afterthought is how incidents happen.

Input guardrails intercept the user’s prompt before it reaches the model. These include length limits (a 100K-token prompt will blow your KV cache budget and hold a GPU hostage for seconds), prompt injection detection (attempts to override system instructions), and content policy enforcement (blocking prompts requesting harmful content). The challenge is doing this fast enough that it doesn’t add noticeable latency—a lightweight classifier running on CPU can screen prompts in under 5 milliseconds.

Output guardrails monitor the model’s response as it streams. This is trickier because you need to make decisions token by token. Some teams run a secondary classifier on the accumulated output at intervals (every 50 tokens, say), terminating generation if policy violations are detected. Others use regex pattern matching for known bad outputs (PII patterns like Social Security numbers, for instance). The tension: output filtering adds latency to every token, and false positives that cut off legitimate responses frustrate users.

Rate limiting per user prevents abuse and protects your GPU budget. Unlike traditional API rate limiting where each request is cheap, a single LLM request can monopolize a GPU for seconds. Rate limits should account for both request count and token volume. A user sending 100 requests of 10 tokens each is a different load profile than a user sending 5 requests of 10,000 tokens each.

Monitoring for LLM serving goes beyond standard HTTP metrics. The essential signals: TTFT and TPOT distributions at p50/p95/p99 (not averages—tail latency is where users suffer), tokens per second throughput, KV cache utilization percentage (approaching 100% means requests are about to queue), and error rates broken down by type (timeout, OOM, content policy violation). I’ll be honest—getting monitoring right was harder than getting serving right. A system that looks healthy on average can be miserable for 5% of users, and you won’t know unless you’re watching the tail.

Bringing It All Together

If you’re still with me, thank you. I hope it was worth the journey through GPU memory hierarchies, restaurant analogies, and cost spreadsheets.

We started with the fundamental insight that autoregressive generation makes LLM serving memory-bandwidth-bound, not compute-bound. We discovered that the KV cache—storing key-value vectors for every prior token at every layer—dominates the memory budget. We upgraded from static to continuous batching, turning our sit-down restaurant into a conveyor belt that never has empty seats. We borrowed virtual memory from operating systems to build PagedAttention, eliminating KV cache waste. We deployed a small draft model alongside the big target model for speculative decoding, cheating the sequential bottleneck. We split models across GPUs with tensor parallelism, navigated the framework landscape, designed streaming APIs, and stared at the real cost arithmetic of running these systems at scale.

My hope is that the next time someone asks you how to serve an LLM in production, instead of mumbling “put vLLM in front of it,” you’ll have a pretty darn good mental model of what vLLM is actually doing under the hood—and whether it’s the right choice for your particular restaurant.

Resources and Credits

“Efficient Memory Management for Large Language Model Serving with PagedAttention” (Kwon et al., 2023) — The vLLM paper. The one that started the modern serving revolution. If you read one paper, read this one.
“Orca: A Distributed Serving System for Transformer-Based Generative Models” (Yu et al., 2022) — Introduced continuous batching. Reading this after struggling with static batching is like finding out conveyor belt sushi exists.
“Fast Inference from Transformers via Speculative Decoding” (Leviathan et al., 2023) — The speculative decoding paper. The math behind the acceptance criterion is elegant.
“Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism” (Shoeybi et al., 2019) — The foundational work on tensor parallelism that all serving frameworks build upon.
vLLM documentation (vllm.ai) — Wildly practical. The benchmark section alone is worth your time for understanding what framework to use when.
Anyscale’s LLM serving benchmark series — Insightful head-to-head comparisons of vLLM, TGI, and TensorRT-LLM under realistic workloads. Updated frequently.

← Previous Efficient LLMs — Quantization & PEFT Next → Agentic AI