Efficient LLMs — Quantization & PEFT

Chapter 12: Large Language Models Section 7

I treated my first large language model like a museum exhibit. Look, don’t touch. When someone on my team suggested “quantizing it to 4-bit,” I physically winced. You want to round seven billion carefully trained floating-point numbers down to four bits each? That felt like compressing a symphony into a ringtone. So I avoided it. I kept renting larger and larger GPU instances, watching the bills climb, telling myself that precision was non-negotiable.

Then one morning I woke up to a $2,400 cloud bill for a single week of inference. And I also needed to fine-tune the model for three different clients, each wanting their own domain-specific behavior. The full fine-tuning run for a 7B model was going to need 84 GB of VRAM — more than my A100 could offer. That was the morning the discomfort of not understanding quantization and parameter-efficient fine-tuning grew too painful to keep ignoring. Here is that dive.

Quantization is the practice of representing model weights (and sometimes activations) in fewer bits — going from 16-bit or 32-bit floating point down to 8-bit integers, 4-bit integers, or even lower. The first practical 4-bit methods for LLMs appeared in 2022–2023 (GPTQ, AWQ), and they’ve become the standard way large models actually get deployed. Parameter-Efficient Fine-Tuning (PEFT) is a family of techniques that customize a model’s behavior by training only a small fraction of its parameters — often less than 1%. LoRA (2021) is the dominant method, and QLoRA (2023) married quantization with LoRA to let you fine-tune a 65B model on a single GPU.

Before we start, a heads-up. We’re going to be working through some arithmetic about memory, touching on matrix decompositions, and writing actual training code. You don’t need to know any of it beforehand. We’ll add what we need, one piece at a time.

This isn’t a short journey, but I hope you’ll be glad you came.

Contents

  • The suitcase problem
  • What quantization actually does to a number
  • Symmetric vs asymmetric quantization
  • Granularity: how many numbers share a ruler
  • Post-training quantization: GPTQ, AWQ, and GGUF
  • SmoothQuant: taming the outliers
  • Quantization-aware training
  • Rest stop
  • The fine-tuning memory wall
  • LoRA: the low-rank insight
  • LoRA in practice: initialization, scaling, and rank
  • QLoRA: quantized base, full-precision adapters
  • The rest of the PEFT family
  • Choosing your method
  • Wrap-up
  • Resources

The Suitcase Problem

Imagine you’re packing for a two-week trip, but the airline says your suitcase can only weigh 24 kilograms. Your entire wardrobe weighs 140 kg. You have two choices: buy a bigger suitcase (rent more GPUs), or learn to pack smarter (quantization). Most people start by trying to buy a bigger suitcase. Eventually, everyone learns to pack.

Let’s make this concrete. A 7-billion-parameter model stores each parameter as a 16-bit floating-point number. That’s 2 bytes per parameter. So the weight storage alone is:

7,000,000,000 parameters × 2 bytes = 14,000,000,000 bytes = 14 GB

An NVIDIA RTX 4090 has 24 GB of VRAM. An A100 has 80 GB. So a 7B model in FP16 fits on a 4090 with room to spare, and a 70B model (140 GB) doesn’t fit on anything with a single GPU. That’s the suitcase problem for inference.

For fine-tuning, the suitcase gets even heavier. You need the weights, the gradients (same size as weights), and the optimizer states. Adam keeps two extra copies — a running mean and a running variance — both in FP32. The math shakes out like this:

# The memory budget for fine-tuning a 7B model

weights    = 7B × 2 bytes  =  14 GB   (FP16)
gradients  = 7B × 2 bytes  =  14 GB   (FP16)
optimizer  = 7B × 8 bytes  =  56 GB   (FP32 master + 2 × FP32 moments)
                              ─────────
total                       ≈  84 GB

# For a 70B model:  840 GB.  That’s ten A100-80GB GPUs,
# and we haven’t even counted activations yet.

This is the problem that the rest of this section exists to solve. We’re going to pack the same model into a much smaller suitcase — and we’re going to learn to customize it without unpacking everything.

I’ll be using a running scenario throughout: you’ve trained (or downloaded) a 7B language model that works well for general conversation. Now you need to deploy it for a customer support product. You need it to run on a single GPU. And three different clients want their own customized version. This is the kind of real-world problem where quantization and PEFT go from “nice to know” to “I literally cannot ship without this.”

What Quantization Actually Does to a Number

Before we get to fancy methods with intimidating acronyms, we need to understand one very simple operation: rounding.

A 16-bit floating-point number (FP16) can represent about 65,000 distinct values. An 8-bit integer (INT8) can represent 256 values. A 4-bit integer can represent 16 values. Quantization is the act of mapping those 65,000 possible values down to 256, or 16, or even fewer. Every value in the original weight matrix gets snapped to the nearest value on a much coarser grid.

Let’s walk through a tiny example. Say we have four weights from one layer of our model:

Original (FP16):   [0.23,  -0.71,  0.02,  1.05]

We want to store these as 4-bit signed integers, which means we have values from −8 to +7 — a total of 16 possible slots. First, we find the largest absolute value: 1.05. We compute a scale factor that maps 1.05 to the maximum INT4 value (7):

scale = 1.05 / 7 = 0.15

Now we divide each weight by the scale and round to the nearest integer:

0.23  / 0.15 =  1.53  →  round to  2
-0.71 / 0.15 = -4.73  →  round to -5
0.02  / 0.15 =  0.13  →  round to  0
1.05  / 0.15 =  7.00  →  round to  7

Quantized (INT4):   [2,  -5,  0,  7]

Each of those numbers fits in 4 bits instead of 16. To use the weights during inference, we multiply them back by the scale factor:

Dequantized:   [2 × 0.15,  -5 × 0.15,  0 × 0.15,  7 × 0.15]
             = [0.30,      -0.75,       0.00,       1.05]

Error:         [+0.07,     -0.04,       -0.02,      0.00]

That’s quantization. We introduced some rounding error — 0.23 became 0.30, 0.02 became 0.00 — but the overall structure of the values is preserved. The weight that was biggest is still biggest. The one that was negative is still negative. And we’re using 4× less memory.

The entire field of quantization research is, at its core, about making these rounding errors as small as possible. Everything from GPTQ to AWQ to NF4 is a different strategy for minimizing the damage of this rounding step.

Symmetric vs Asymmetric Quantization

The example above used symmetric quantization: the range is centered at zero, stretching from −max to +max. One scale factor, done. This works well for LLM weight matrices, which tend to be roughly centered around zero (they’re usually close to a Gaussian distribution).

But what about values that aren’t centered at zero? If you have activations coming out of a ReLU layer, they’re always non-negative. If we use symmetric quantization, half of our 16 INT4 slots represent negative numbers that will never appear. We’re wasting half our precision.

Asymmetric quantization fixes this by allowing the range to be off-center. Instead of mapping [−max, +max], it maps [min, max]. This requires a second number, the zero-point, to track where zero falls on the quantized grid.

# Symmetric: one number to store (scale)
quantized = round(value / scale)
dequantized = quantized × scale

# Asymmetric: two numbers to store (scale + zero_point)
quantized = round(value / scale) + zero_point
dequantized = (quantized - zero_point) × scale

In practice, LLM weight quantization almost always uses symmetric quantization. The weights are well-behaved enough that the centered range works well, and you save the overhead of storing a zero-point for every group. Activation quantization sometimes benefits from asymmetric, especially after ReLU-style nonlinearities.

Granularity: How Many Numbers Share a Ruler

Here’s a subtlety I didn’t appreciate until I saw it break things: the scale factor from our example above was computed from the entire set of four weights. In a real model, a single weight matrix might have 4096 × 4096 = 16 million values. If we compute one scale factor for all 16 million values, a single outlier — one weight that’s 10× larger than the rest — forces us to stretch the quantization grid so wide that most values get crushed into just a few bins near zero.

Think of it like using one ruler to measure both a skyscraper and a matchbox. The tick marks on the ruler are so far apart that you can’t tell a 2-inch matchbox from a 3-inch one.

This problem is called per-tensor quantization, and it’s the worst option. You get one scale factor for millions of values. One outlier ruins everything.

Per-channel quantization is better: each row of the weight matrix gets its own scale factor. Now each row has its own ruler, so an outlier in row 37 doesn’t affect row 4,000. This helps a lot.

Per-group quantization takes it further: we chop each row into groups of, say, 128 consecutive values, and each group gets its own scale factor. Now we have 128 values sharing a ruler, which is fine-grained enough to handle local variations in the weight distribution. The cost is storing one extra FP16 scale factor per 128 INT4 values — that’s an overhead of about 0.8 bits per parameter. Going from a theoretical 4.0 bits to about 4.8 effective bits.

Group-128 has become the industry standard for 4-bit LLM quantization. I still don’t have a clean theoretical explanation for why 128 is the magic number, but empirically it lands right in the sweet spot between quality and overhead. Every major method — GPTQ, AWQ, GGUF’s k-quant — defaults to it.

Our suitcase analogy extends: per-tensor is like one compression bag for all your clothes. Per-group is like rolling each outfit individually — more packaging overhead, but everything arrives in better shape.

Post-Training Quantization: GPTQ, AWQ, and GGUF

Everything so far has been basic rounding. The question is: can we do something smarter than naive rounding to minimize the damage? That’s what the big post-training quantization (PTQ) methods do. You take a pre-trained model, run it through a quantization algorithm, and get a smaller model with (hopefully) minimal quality loss. No retraining required.

Let me walk through the three methods that matter most, in the order they appeared.

GPTQ: Error Compensation

GPTQ (Frantar et al., 2023) was the first method that could reliably quantize LLMs to 4 bits without catastrophic quality loss. The key insight: when you round one weight and introduce error, you can adjust the remaining unquantized weights to compensate.

Imagine you’re packing a suitcase and you fold a shirt a bit too tight — it creates a bulge. Instead of ignoring it, you rearrange the next few items to smooth things out. GPTQ does this for every weight, one at a time, using the Hessian (second-order gradient information from a small calibration dataset) to figure out which adjustments matter most.

In practice, you feed 128–256 representative examples through the model to compute the Hessian, then quantize layer by layer. The whole process takes a few hours for a 7B model on a single GPU.

# GPTQ with the AutoGPTQ library
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,             # 4-bit quantization
    group_size=128,     # per-group with 128 values per group
    damp_percent=0.01,  # dampening for Hessian inverse stability
)

model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
model.quantize(calibration_data)    # needs ~128 examples
model.save_quantized("model-gptq-4bit")
# 14 GB model → ~3.9 GB on disk

GPTQ was a breakthrough, but it has limitations. The quantization process is slow. Quality depends heavily on having good calibration data. And the Hessian-based compensation, while smart, is expensive to compute.

AWQ: Protecting What Matters

AWQ (Lin et al., 2023) came next and took a different philosophical approach. Instead of compensating for errors after the fact, AWQ asks: which weights matter most, and can we protect them?

The answer comes from looking at activations. About 1% of weight channels consistently produce large-magnitude activations — they’re “salient” channels that carry disproportionate importance. If naive quantization crushes these salient weights, quality collapses. AWQ’s trick: scale up the salient weights before quantizing (giving them more of the INT4 range), and scale down the corresponding activations to compensate. The mathematical result is identical, but the salient weights get much finer quantization resolution.

# AWQ conceptual trick (not actual code, but the idea)
#
# Before:  y = Quantize(W) · x          ← salient weights get crushed
# After:   y = Quantize(W · s) · (x / s)   ← salient weights preserved
#
# s is chosen based on activation magnitudes
# The math is identical: (W · s) · (x / s) = W · x

AWQ tends to produce slightly better quality than GPTQ at the same bit width (roughly 0.1–0.2 perplexity points closer to the FP16 baseline), quantizes faster, and integrates beautifully with vLLM for production serving. For GPU deployment, it’s become the default choice in 2024.

GGUF: The Laptop Format

GPTQ and AWQ are GPU-first formats. GGUF (GPT-Generated Unified Format) is something completely different: it’s the format used by llama.cpp, and it’s designed for CPU inference. This is what lets you run a 7B model on a MacBook with no GPU.

GGUF offers a whole spectrum of quantization levels, each a different tradeoff:

Quant Type    Bits/Weight    7B Size     Quality
──────────────────────────────────────────────────
Q2_K          ~2.6 bits      ~2.8 GB     Barely usable
Q4_K_M        ~4.8 bits      ~4.1 GB     Good  ← sweet spot
Q5_K_M        ~5.7 bits      ~4.8 GB     Very good
Q6_K          ~6.6 bits      ~5.5 GB     Excellent
Q8_0          8.0 bits       ~7.2 GB     Near-lossless

The “K” in those names stands for k-quant, a clever technique that allocates more bits to important layers (like attention projections) and fewer bits to less important ones (like certain feed-forward layers). The “M” is for medium aggressiveness. There’s also “S” for small (more aggressive, lower quality) and “L” for large (less aggressive, higher quality).

GGUF matters because it democratized local LLM access. Tools like Ollama, LM Studio, and Jan all use GGUF under the hood. When someone says “I run LLaMA on my laptop,” they’re almost certainly using a GGUF quantized model.

Coming back to our scenario: if a client wants to run the model on their own hardware without a GPU, GGUF Q4_K_M is the answer. If we’re serving from a GPU in the cloud, AWQ 4-bit with vLLM. Different suitcases for different trips.

SmoothQuant: Taming the Outliers

All three methods above quantize weights only. Activations stay in full precision during inference. There’s a good reason for this: activations have nasty outliers. Some channels produce values 100× larger than others, and if you try to quantize them to INT8, you either clip the outliers (catastrophic quality loss) or stretch the range so wide that all the normal values collapse to a few bins near zero.

But if we could quantize both weights and activations to INT8, we’d unlock hardware-native INT8 matrix multiplication. On modern GPUs, INT8 matmul is roughly 2× faster than FP16. That’s worth pursuing.

SmoothQuant (Xiao et al., 2023) found an elegant solution: transfer the quantization difficulty from activations to weights. The idea is to divide each activation channel by a smoothing factor s, and multiply the corresponding weight column by s. Mathematically, the result is identical: (X / s) · (W · s) = X · W. But now the activation outliers have been tamed, and the weights have absorbed the extra variance. Both are now “smooth” enough to quantize to INT8 cleanly.

# SmoothQuant transform
#
# Original:    Y = X · W         (X has outlier channels)
# Smoothed:    Y = (X / s) · (W · s)
#
# s = max(|X|)^α / max(|W|)^(1-α)
# α controls migration strength (α = 0.5 is typical)
#
# After smoothing, both X/s and W·s quantize cleanly to INT8

SmoothQuant is the go-to method for high-throughput production serving where you want W8A8 (weights and activations both in INT8). The 2× speedup from hardware INT8 is real and meaningful at scale.

Quantization-Aware Training

Everything above happens after training. But what if the model could learn to be quantization-friendly during training itself?

Quantization-aware training (QAT) simulates quantization during the forward pass: weights are fake-quantized (rounded to low precision, then cast back to float for gradient computation). The model learns to produce weights whose values land cleanly on the quantization grid, reducing the rounding errors that plague post-training methods.

The downside: QAT requires actually training (or fine-tuning) the model, which is expensive. For most practitioners, PTQ methods like GPTQ and AWQ are good enough. QAT becomes worth it when you’re pushing to extreme compression (2-bit) or when every last fraction of a perplexity point matters.

Recent work like EfficientQAT (2024) has made this more accessible — training a 2-bit 70B model on a single A100 in about 41 hours, with less than 3% accuracy degradation. And on the frontier, researchers at Microsoft have demonstrated BitNet — models trained from scratch in 1-bit (ternary: -1, 0, +1) — which would fundamentally change the hardware requirements for inference. I’m still developing my intuition for whether 1-bit models will become practical for mainstream use, but the direction is fascinating.

Rest Stop

Congratulations on making it this far. Seriously — you can stop here if you want.

You now have a solid mental model of model compression. You understand that quantization is controlled rounding, that the granularity of the scale factors matters enormously, that GPTQ compensates for errors, AWQ protects salient weights, GGUF enables CPU inference, and SmoothQuant smooths activations for W8A8 deployment. That suitcase analogy we started with? You now know five different ways to pack it.

Here’s the short version for those who want to bail: for GPU serving, use AWQ 4-bit with vLLM. For laptop inference, use GGUF Q4_K_M with Ollama. You’re 80% of the way to being effective in production. There. Done.

What we haven’t covered yet is the other half of the problem: customizing the model. Our three clients each want different behavior, and full fine-tuning is impossibly expensive. The techniques that follow — LoRA, QLoRA, and the broader PEFT family — are how the industry actually solves this. They’re also some of the most elegant ideas in modern ML.

If the discomfort of not knowing how to efficiently customize a 7B model for three different clients is nagging at you, read on.

The Fine-Tuning Memory Wall

We’ve solved the deployment problem. Our 7B model, quantized to 4-bit, fits comfortably on a single GPU at about 4 GB. But now Client A wants the model tuned for medical Q&A, Client B for legal document summarization, and Client C for customer support in Spanish. We need to fine-tune three separate versions.

Standard fine-tuning updates every one of the 7 billion parameters. We already computed the memory: 84 GB for weights + gradients + optimizer states. That’s more than an A100-80GB can hold. And we’d have to store three full copies of the model — 42 GB each in FP16 — one per client.

This is the motivation for PEFT. Not academic curiosity. Not a “nice to have” optimization. We literally cannot do the thing our clients need without it.

LoRA: The Low-Rank Insight

I’ll be honest — when I first read the LoRA paper (Hu et al., 2021), I didn’t believe the central claim. They argued that when you fine-tune a pre-trained model, the actual change to each weight matrix lives in a tiny subspace. Even though a weight matrix might be 4096 × 4096 (about 16 million numbers), the update during fine-tuning has very low intrinsic rank — maybe 8 or 16 meaningful dimensions out of 4096.

Let’s build this up from a toy example. Say we have a small 4 × 4 weight matrix W in our model. During fine-tuning, the update ΔW would normally be another 4 × 4 matrix — 16 new parameters to learn:

W (frozen base)        ΔW (full update)
┌                 ┐     ┌                 ┐
│ 0.5  0.1 -0.3  0.2│     │ 0.02  0.01 -0.01  0.03│
│-0.1  0.8  0.4 -0.6│     │ 0.04  0.02 -0.02  0.06│
│ 0.3 -0.2  0.7  0.1│     │ 0.01  0.005 -0.005 0.015│
│ 0.6  0.0 -0.5  0.9│     │ 0.03  0.015 -0.015 0.045│
└                 ┘     └                 ┘

Look closely at ΔW. Every row is a scaled version of the same pattern: [2, 1, -1, 3]. Row 1 is that pattern × 0.01, row 2 is that pattern × 0.02, row 3 is × 0.005, row 4 is × 0.015. This matrix has rank 1 — all the information can be captured by one column vector times one row vector:

ΔW = B · A

where B (4×1) = [0.01, 0.02, 0.005, 0.015]¹
and   A (1×4) = [2, 1, -1, 3]

Parameters: 4 + 4 = 8  (instead of 16)
That’s a 2× reduction for our tiny example.

For a real 4096 × 4096 matrix with rank r = 16:

Full ΔW:   4096 × 4096 = 16,777,216 parameters
LoRA:       4096 × 16 + 16 × 4096 = 131,072 parameters

That’s a 128× reduction.  0.78% of the original.

This is the LoRA decomposition. Instead of learning a full update matrix ΔW, we learn two small matrices: A (the “down-projection,” from d dimensions to r) and B (the “up-projection,” from r back to d). Their product B·A is a rank-r approximation of whatever update the model needs.

The reason this works connects to a concept called intrinsic dimensionality. Research has shown that for most downstream tasks, the useful fine-tuning updates live in a subspace of perhaps a few hundred dimensions, even though the parameter space has billions of dimensions. Think of it like steering a ship: the ship has millions of parts, but to change its course, you only need to turn the rudder. LoRA is learning the rudder, not rebuilding the hull.

I should note: this works because the pre-trained model is already good. It has learned rich, general representations during pre-training. Fine-tuning is making small adjustments to these representations for a specific task. If you were training from scratch, the updates wouldn’t be low-rank at all. The foundation has to be solid before the rudder metaphor applies.

LoRA in Practice: Initialization, Scaling, and Rank

The LoRA paper made several specific design choices that turn out to matter a lot in practice.

Initialization. Matrix A is initialized with small random Gaussian values (standard Xavier/Kaiming). Matrix B is initialized to all zeros. This means that at the start of training, B · A = 0, so the model is identical to the pre-trained base. There’s no random perturbation at step 0, no sudden change in behavior. Training starts from exactly where pre-training left off. This is a small detail that makes a big difference in training stability.

The alpha parameter. The LoRA output gets scaled by α/r before being added to the base model’s output:

output = W·x + (α/r) · B·A·x

The ratio α/r controls how aggressively the LoRA adapter modifies the base model. Common choices: α = r (scaling factor of 1, no extra amplification) or α = 2r (scaling factor of 2, more aggressive updates). Higher α means faster adaptation but more risk of instability. For most tasks, α = 2r is a solid default.

Which layers to adapt. You don’t have to put LoRA on every weight matrix. The original paper targeted only the query and value projections in attention. In practice, targeting all linear layers (Q, K, V, O projections in attention plus gate, up, and down projections in the feed-forward network) gives the best results. The overhead is still tiny — less than 1% of total parameters even with all layers adapted.

Rank selection is where I’ve gotten burned more than once. My instinct was always to crank the rank higher “to be safe.” That’s wrong. Higher rank means more capacity, which means more risk of overfitting, especially on small datasets. Here’s the rough guide I’ve converged on after too much trial and error:

Rank 4–8:    Simple tasks. Style transfer, single-domain classification.
             Good when your task is close to what the model already does.

Rank 16–32:  The sweet spot for most fine-tuning. Instruction following,
             domain adaptation, moderate complexity. Start here.

Rank 64–128: Complex, diverse datasets. When r=32 isn’t enough and you
             have the data to justify the extra capacity.

Rank 256+:   Diminishing returns. At this point, consider whether full
             fine-tuning with DeepSpeed might be simpler.

The crucial insight about rank: if you have 1,000 training examples and you set r=64 on all layers, you will almost certainly memorize the training data. Start with r=16. Increase only when validation loss plateaus and you’ve verified you’re not overfitting.

Zero inference overhead. This is LoRA’s killer feature. After training, you compute W_merged = W + B·A and replace the original weight matrix. The merged model has exactly the same architecture as the original — no extra layers, no branching paths, no runtime overhead. It’s as if you did full fine-tuning, except you got there with 128× fewer trainable parameters.

For our three-client scenario: we train three separate LoRA adapters, each about 80 MB (for rank 16 on all layers of a 7B model). We store one base model (14 GB in FP16) and three tiny adapter files. To switch between clients, we swap the adapter. To deploy, we merge the adapter into the base and quantize. Three clients, one base model, 240 MB of adapters total.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,              # alpha/r = 2
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable: ~20M (0.31%)  |  total: ~6.7B

QLoRA: Quantized Base, Full-Precision Adapters

LoRA brought trainable parameters down from billions to millions. But there’s still a problem: you need to load the base model into VRAM in FP16 to compute the forward pass. For a 7B model, that’s 14 GB. For a 65B model, that’s 130 GB. LoRA made the optimizer lightweight, but the base model itself is still heavy luggage.

QLoRA (Dettmers et al., 2023) has a bold idea: what if the base model was quantized to 4-bit during training, and only the LoRA adapters ran in full precision? The intuition seems dangerous — surely 4-bit base weights would produce garbage gradients for the adapters. But the result was the paper’s most surprising finding: it barely matters. QLoRA matches full 16-bit fine-tuning quality on benchmark after benchmark.

I’ll be honest, the fact that this works still surprises me. The base weights are only used to compute the forward pass and to route gradients to the adapter parameters. Apparently, 4-bit precision is enough for that routing job. The adapter parameters themselves, which are the ones actually being optimized, stay in BF16 where they get clean gradient updates.

QLoRA introduced three specific innovations that made this work:

NF4 (4-bit NormalFloat). Standard INT4 spaces its 16 quantization levels evenly across the range. But LLM weights follow a roughly Gaussian distribution — most values cluster near zero, with sparse outliers in the tails. NF4 spaces its 16 levels to match this bell curve, placing more levels near zero (where most weights are) and fewer in the tails. More levels where the data is dense means less information lost per bit.

# Conceptual: INT4 vs NF4 level spacing
#
# INT4 (uniform):     |--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
#                    -8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7
#                     Even spacing. Wastes levels in sparse tails.
#
# NF4 (Gaussian):    |-|--|---|------|------|---|--|-|
#                     Dense near zero        Sparse in tails
#                     Matches where weights actually live.

Double quantization. Per-group quantization stores one FP16 scale factor per 128 weights. For a 7B model, that’s about 55 million scale factors × 2 bytes = ~110 MB of overhead. Double quantization quantizes these scale factors themselves to 8-bit, cutting the overhead roughly in half. It’s quantizing the quantization constants. A small savings per parameter, but meaningful at 7 billion of them.

Paged optimizers. Even with a 4-bit base model and tiny LoRA adapters, optimizer states can occasionally spike beyond GPU memory — especially with gradient accumulation on long sequences. Paged optimizers use CUDA unified memory to transparently move optimizer states to CPU RAM when GPU memory fills up, then bring them back when needed. Think of it as virtual memory for your GPU.

Putting it together, the memory budget becomes remarkable:

Method                Weights    Gradients  Optimizer   Total
──────────────────────────────────────────────────────────────
Full fine-tune (FP16) 14 GB      14 GB      56 GB       ~84 GB
LoRA (FP16 base)      14 GB      ~0.04 GB   ~0.16 GB    ~14.2 GB
QLoRA (NF4 base)       3.5 GB    ~0.04 GB   ~0.16 GB    ~3.7 GB

# For 65B:
Full fine-tune        130 GB     130 GB     520 GB      ~780 GB
QLoRA                  ~35 GB    ~0.2 GB    ~0.8 GB     ~36 GB  ← one A100!

Our running scenario: we can now fine-tune all three client-specific adapters on a single A100, one after the other. Each run takes a few hours and costs $10–50 in cloud GPU time. That $2,400 weekly bill is a distant memory.

from transformers import BitsAndBytesConfig
import torch

# Load the base model in 4-bit NF4
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,        # double quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
# This 7B model now uses ~3.5 GB of VRAM

# Prepare for LoRA training
from peft import prepare_model_for_kbit_training, get_peft_model
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)  # same config as before

# Train with paged optimizer
training_args = TrainingArguments(
    optim="paged_adamw_8bit",    # paged optimizer for memory safety
    gradient_checkpointing=True,  # trade compute for memory
    bf16=True,
    # ... other standard training args
)
A mistake I’ve seen too often: fine-tuning with QLoRA (4-bit base), merging the adapter, then quantizing the merged model again to 4-bit for deployment. You’re quantizing an already-quantized model — double rounding. Instead: reload the base in FP16, merge the LoRA adapter at full precision, then quantize once for deployment. The adapter was trained against the 4-bit base, but the final merge should happen at full precision.

The Rest of the PEFT Family

LoRA dominates the PEFT landscape, but it’s not the only game in town. Four other methods are worth knowing, because each one teaches something about what parts of a transformer are malleable.

Adapters (Houlsby et al., 2019) were the original PEFT method. They insert small bottleneck layers — a down-projection, a nonlinearity, an up-projection — between existing transformer layers. Data flows through the adapter on every forward pass. The idea is sound, but unlike LoRA, adapters can’t be merged away after training. They add permanent inference latency. That’s why they’ve been largely superseded.

Think of adapters as building a detour in a highway. The cars still get where they need to go, but there’s always extra travel time. LoRA, by contrast, is like widening a lane that already exists — once the construction is done, traffic flows at full speed.

Prefix tuning (Li & Liang, 2021) is an entirely different idea. Instead of modifying weights, it learns a set of “virtual tokens” — continuous key-value pairs that get prepended to every attention layer. These virtual tokens steer the attention mechanism without changing any model parameters. The approach works well for generation tasks, but the virtual tokens consume context window space, which is a scarce resource in long-context applications.

Prompt tuning (Lester et al., 2021) is the minimal version of prefix tuning. It learns soft prompt embeddings that are prepended at the input layer only, not at every attention layer. Extremely parameter-efficient — sometimes fewer than 100K trainable parameters — but correspondingly less expressive. The sweet spot is when you have many tasks and want a tiny adapter per task (imagine hundreds of clients, each with a 100KB adapter file).

IA3 (Liu et al., 2022) takes minimalism further: it learns three vectors (not matrices) that rescale the key, value, and feed-forward activations. Even fewer parameters than LoRA. It’s competitive on some benchmarks, but for complex tasks that require significant behavioral change, the lack of capacity shows. I think of IA3 as adjusting the volume knobs on an existing signal, while LoRA is remixing the signal itself.

The practical takeaway, which I’ve arrived at after trying more methods than I care to admit: use LoRA. If memory is the bottleneck, use QLoRA. The ecosystem support (HuggingFace PEFT, Axolotl, LLaMA-Factory, Unsloth) is overwhelmingly focused on LoRA. The quality is near full fine-tuning. The inference overhead is zero after merging. Only reach for other methods if LoRA doesn’t satisfy a specific constraint — and know exactly which constraint that is.

Choosing Your Method

I’ve gotten this decision wrong more times than I’d like to admit. Here’s the decision tree I’ve converged on, distilled from a lot of expensive mistakes.

If you need to shrink a model for inference: quantize it. AWQ 4-bit for GPU serving, GGUF Q4_K_M for CPU/laptop. This is the first thing you should try, always. The quality loss is smaller than most people expect.

If you need to customize a model’s behavior: fine-tune it with LoRA (r=16, all linear layers). If your GPU has less than 24 GB, use QLoRA instead. Save the adapter, merge it into the base model, then quantize the merged model for deployment.

If you need both — and you almost always do — the workflow is: QLoRA fine-tune → merge at FP16 → quantize for deployment. That’s the pipeline that most production LLM teams actually use.

# The standard production pipeline:
#
# 1. Download pre-trained base model
# 2. QLoRA fine-tune on your data (r=16, NF4, all linear layers)
#    Cost: $10-50 for 7B, few hours on a single GPU
# 3. Merge LoRA adapter into base model at FP16
#    merged = model.merge_and_unload()
# 4. Quantize merged model for deployment
#    AWQ 4-bit for GPU serving (via vLLM)
#    GGUF Q4_K_M for CPU serving (via Ollama)
# 5. Serve and monitor quality

For our running scenario, the final architecture looks like this: one base LLaMA-2-7B model. Three QLoRA-trained adapters (medical, legal, support), each ~80 MB. At deployment time, each client gets the base model merged with their adapter, quantized to AWQ 4-bit (~4 GB), served through vLLM on a single GPU. Total storage for all three: ~12 GB plus 240 MB of adapters, served from a single $1/hour cloud GPU. Compare that to our starting point: three full FP16 models at 42 GB each, requiring three expensive GPU instances.

That’s the power of this stuff. It’s not theoretical. It’s not a premature optimization. It’s the difference between a product that ships and one that stays on the whiteboard.

Wrap-Up

If you’re still with me, thank you. I hope it was worth the trip.

We started with a suitcase problem: a model too large to fit on a GPU, too expensive to fine-tune, and too monolithic to customize for different clients. We learned that quantization — controlled rounding with smart scale factors — can compress a 14 GB model to 3.5 GB with surprisingly little damage. We walked through GPTQ’s error compensation, AWQ’s saliency protection, GGUF’s CPU-friendly format, and SmoothQuant’s elegant activation smoothing. Then we crossed into fine-tuning territory, where LoRA’s low-rank insight lets us customize a model by training 0.3% of its parameters, and QLoRA pushes the base model itself into 4-bit so the whole operation fits on hardware that would have been laughably inadequate for full fine-tuning.

My hope is that the next time someone mentions quantizing a model to 4-bit or fine-tuning with LoRA, instead of that instinctive wince I used to feel — treating the model like a fragile exhibit — you’ll see these techniques for what they are: precise, well-understood engineering that makes the difference between a model that runs in production and one that exists only in research papers. The symphony doesn’t become a ringtone. It becomes a well-compressed FLAC file that fits on your phone and sounds nearly identical.

Resources