Loss Functions

Chapter 7: Deep Learning Foundations Section 3 of 7

The Confession

I used the same loss function for everything for an embarrassingly long time. Regression? MSE. Classification? Cross-entropy. I didn't think about why. I copied what the tutorial used, the loss went down during training, and I moved on. It was like cooking with only salt — technically seasoning, but missing the entire spice rack.

Then I ran into a dataset with a handful of corrupted labels — sensor readings where someone had entered values in the wrong units. My model's predictions were wild. The loss was enormous, but the model was bending itself into knots trying to accommodate five garbage points out of ten thousand. That was when I realized: the loss function isn't a detail. It's the entire definition of what your model is trying to become. Change the loss, change the model. Here is that dive.

A loss function takes a prediction and a ground truth and outputs a single number: how wrong was this prediction? The optimizer's entire job is to make that number smaller. That's the training loop. But here's what that innocent-sounding definition hides — the loss function encodes a complete philosophy about what "wrong" means. Two identical architectures, same data, same optimizer. Train one with squared error, the other with absolute error. You get different models. Not from randomness. From a fundamental disagreement about how to weigh mistakes.

Before we start, a heads-up. We'll be touching on probability distributions, information theory, and some calculus (gradients). You don't need to know any of it beforehand. We'll build every concept from scratch, one piece at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The Map

The landscape metaphor
A tiny prediction problem
Mean Squared Error — the judge who hates big mistakes
The Gaussian secret hiding inside MSE
Mean Absolute Error — the stoic judge
Huber Loss — the diplomat
Rest stop
Crossing into probability — binary cross-entropy
Why logarithm? The information surprise
Categorical cross-entropy and KL divergence
The numerical landmine and the logits trick
Focal loss — when easy examples drown the signal
Beyond the basics — contrastive, triplet, and quantile losses
The decision rule
PyTorch patterns that keep you safe

The Landscape Metaphor

Picture the loss function as a landscape — a vast terrain of mountains, valleys, and ridges. Every possible configuration of your model's parameters is a specific location on this terrain, and the altitude at that location is the loss value. Training is the optimizer hiking downhill, searching for the lowest valley.

Here's the thing that took me a while to internalize: the loss function doesn't just measure the terrain. It creates it. Swap MSE for MAE and the entire geography changes. What was a smooth bowl becomes a terrain with sharp creases. What was a single deep valley might split into a ridge with two paths. The optimizer is the same hiker either way — but the mountains it has to navigate are sculpted entirely by your choice of loss.

We'll keep coming back to this landscape throughout. When I say a loss "has smooth gradients near the optimum," I mean the valley floor slopes gently — the hiker can take small, precise steps to reach the exact bottom. When I say a gradient is "constant regardless of error size," I mean every part of the terrain has the same steepness — the hiker charges forward at full speed even when it's inches from the valley floor, overshooting and oscillating.

A Tiny Prediction Problem

We need a running example to make all of this concrete. Imagine we're building the world's smallest house price predictor. Three houses, three predictions. Our model looked at each house and guessed a price:

House A:  true price = $200k,  predicted = $210k,  error = +$10k
House B:  true price = $350k,  predicted = $340k,  error = -$10k
House C:  true price = $180k,  predicted = $400k,  error = +$220k

Houses A and B have reasonable errors — off by $10k in opposite directions. House C is a disaster. Maybe the model confused a cottage with a mansion, or maybe the label is wrong — someone entered $180k when they meant $380k. Either way, that $220k error is sitting there, and how the loss function handles it will determine what kind of model we end up with.

Every loss function we discuss will be applied to these same three houses. Watch how each one reacts to that outlier in House C. It reveals their personality.

Mean Squared Error — The Judge Who Hates Big Mistakes

MSE is the loss function most people meet first. You take each error, square it, and average the results.

MSE = (1/n) × Σ(yᵢ - ŷᵢ)²

Let's apply it to our three houses.

House A:  (10)²   =    100
House B:  (-10)²  =    100
House C:  (220)²  = 48,400

MSE = (100 + 100 + 48,400) / 3 = 16,200

Look at that. Houses A and B contribute 200 total. House C contributes 48,400. One sample out of three is responsible for 99.6% of the loss. If the optimizer can only reduce one error, it will pour everything into fixing House C, even if that means making A and B slightly worse. That's the squaring at work. An error of 220 isn't 22 times worse than an error of 10 — it's 484 times worse. MSE is screaming at the model to fix the big mistakes first.

When your data is clean and your errors are roughly symmetric — small errors concentrated near zero, large errors rare — this is exactly what you want. The model aggressively cleans up its worst predictions and converges fast.

The gradient tells the same story. The partial derivative of the squared error with respect to the prediction is 2(ŷ - y) — linear in the error. Bigger errors produce bigger gradients, which produce bigger parameter updates. It's a feedback loop that works beautifully when the big errors are genuine mistakes the model should fix, and catastrophically when they're corrupted data the model should ignore.

The Gaussian Secret Hiding Inside MSE

Here's the part that most tutorials mention as a footnote but that fundamentally changes how you think about loss functions. Minimizing MSE is mathematically identical to performing maximum likelihood estimation under the assumption that your errors follow a Gaussian (normal) distribution.

The setup: assume the true relationship is y = f(x) + ε, where ε is random noise drawn from a Gaussian distribution N(0, σ²). The probability of observing a particular target y given the model's prediction ŷ is:

P(y | ŷ) = (1 / √(2πσ²)) × exp(-(y - ŷ)² / (2σ²))

For all our data points, assuming independence, the total probability — the likelihood — is the product of these individual probabilities. We want to find the predictions that make the observed data most probable. Taking the logarithm (because products become sums, and maximization is easier):

log P(data | model) = -1/(2σ²) × Σ(yᵢ - ŷᵢ)² + constant

Maximizing this is identical to minimizing Σ(yᵢ - ŷᵢ)². That's MSE.

So every time you use MSE, you're implicitly telling the model: "I believe the noise in my data is Gaussian — symmetric, concentrated near zero, with light tails." If that belief is roughly correct, MSE is the statistically optimal choice. If it's wrong — if your errors have heavy tails, or if some labels are corrupted — you're optimizing for a fantasy, and the model will oblige by contorting itself to match that fantasy.

I'll be honest — I used MSE for years without understanding this connection. Once it clicked, my relationship with loss functions changed entirely. The loss isn't a measuring tool. It's a statistical assumption.

Mean Absolute Error — The Stoic Judge

MAE takes a fundamentally different philosophical stance. Instead of squaring errors, it takes their absolute value.

MAE = (1/n) × Σ|yᵢ - ŷᵢ|

Same three houses:

House A:  |10|   =  10
House B:  |-10|  =  10
House C:  |220|  = 220

MAE = (10 + 10 + 220) / 3 = 80

House C still dominates, but proportionally? It contributes 220 out of 240 — about 92% of the loss. Compare that to MSE's 99.6%. MAE is less impressed by the outlier. An error of 220 is 22 times worse than an error of 10 — exactly 22 times, no amplification. The outlier can't hijack the entire gradient signal.

The statistical connection mirrors MSE perfectly. Minimizing MAE is equivalent to maximum likelihood under a Laplace distribution — a distribution that looks like a Gaussian but with heavier tails and a sharper peak. Where the Gaussian drops off quickly (making large errors very unlikely), the Laplace distribution allows them more generously. If your data has occasional wild errors — sensor glitches, labeling mistakes, fat-tailed noise — the Laplace assumption is closer to reality, and MAE becomes the optimal choice.

Here's an insight that connected a lot of dots for me: MSE fits the mean of the conditional distribution y|x. MAE fits the median. For symmetric distributions, those are the same point. But for skewed or heavy-tailed distributions, the median is more robust — it doesn't get dragged around by extreme values. MAE gives you that robustness automatically, without you having to think about it.

The tradeoff is in the gradients. The derivative of |x| is a step function: +1 when x is positive, -1 when x is negative, undefined at exactly zero. The gradient doesn't care how close you are to the correct answer. Whether the model is off by $100k or $0.001, it receives the same magnitude of gradient. In our landscape metaphor, the terrain has constant slope everywhere — the hiker charges downhill at full speed and then overshoots the valley floor, turns around, charges back, overshoots again. Convergence near the optimum gets noisy and imprecise. Optimizers like Adam smooth this out somewhat, but MSE's landscape is fundamentally more cooperative near the bottom of the valley.

Huber Loss — The Diplomat

Once you understand the MSE-MAE tradeoff, the natural question is: can we get the best of both? Smooth, precise convergence near the optimum (MSE's strength), but resistance to outliers (MAE's strength)?

Peter Huber asked exactly this question in 1964 and the answer bears his name. Huber loss is quadratic for small errors and linear for large ones, with a parameter δ controlling the crossover point:

Huber(y, ŷ) =
  0.5 × (y - ŷ)²             if |y - ŷ| ≤ δ
  δ × (|y - ŷ| - 0.5 × δ)    if |y - ŷ| > δ

With δ = 15 (in thousands), our three houses:

House A:  |10| ≤ 15  → 0.5 × 10²  = 50       (quadratic zone)
House B:  |10| ≤ 15  → 0.5 × 10²  = 50       (quadratic zone)
House C:  |220| > 15 → 15 × (220 - 7.5) = 3,187.5  (linear zone)

Compare that to MSE's 48,400 for House C alone. The outlier's contribution dropped by 93%. But Houses A and B still get the smooth quadratic treatment, so convergence near the optimum stays precise.

In the landscape metaphor, Huber loss creates a terrain that's a smooth bowl near the valley floor (like MSE) but flattens into gentle constant-slope planes far from the center (like MAE). The hiker gets precise footing where it matters and doesn't get dragged off by distant distortions.

The δ parameter encodes your definition of "what counts as an outlier." Too small and everything is an outlier — you're using MAE with extra steps. Too large and nothing is — you're using MSE with a wasted hyperparameter. A practical starting point: set δ to the median absolute deviation of your residuals, then tune. PyTorch's nn.HuberLoss(delta=1.0) works well when your targets are standardized to have unit variance. In reinforcement learning (DQN in particular), Huber loss — also known as smooth L1 loss — is the default because TD-error targets are inherently noisy.

I still reach for Huber any time I suspect my data might contain a few bad labels but don't want to sacrifice the convergence benefits of MSE. It's the diplomat of regression losses — agreeable with everyone, offensive to no one.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a mental model for the three regression losses: MSE (aggressive against big errors, assumes Gaussian noise, estimates the mean), MAE (stoic, assumes Laplacian noise, estimates the median), and Huber (diplomat, quadratic near zero, linear far away, controlled by δ). You know each one creates a different optimization landscape, and you can predict how each reacts to outliers without running a single experiment.

That's a solid foundation, and it covers the majority of regression problems you'll encounter. The short version for anyone in a hurry: use MSE for clean data, Huber when outliers worry you, MAE when you want the median. There — you're 60% of the way through loss functions.

But classification losses are a different beast. They stop measuring distances and start measuring probabilities, and the reasoning shifts from geometry to information theory. If the discomfort of not knowing what's underneath cross-entropy is nagging at you, read on.

Crossing into Probability — Binary Cross-Entropy

We need a new running example. Our house price predictor outputs a number. A classifier outputs a probability. Different world, different loss.

Imagine we're building a model that looks at an email and predicts the probability it's spam. Three emails:

Email 1:  actually spam,    model says 95% spam   → confident and correct
Email 2:  actually not spam, model says 10% spam   → confident and correct
Email 3:  actually spam,    model says 3% spam     → confident and WRONG

How do we score this? We can't use MSE on probabilities — it wouldn't punish Email 3 nearly enough. The model said "3% chance this is spam" and it was spam. That's not a small mistake. That's a catastrophic failure of confidence.

Think about it from a gambling perspective. Before seeing the answer, the model places a bet on each email. For Email 1, it bet 95% on spam, and spam was correct — it assigned 0.95 probability to what actually happened. For Email 3, it bet 3% on spam, and spam was correct — it assigned only 0.03 to what actually happened. We want a loss that rewards high bets on correct outcomes and destroys the model for confident wrong bets.

Define p_t as the probability the model assigned to whatever actually happened:

Email 1:  pₜ = 0.95   (assigned 95% to the correct outcome: spam)
Email 2:  pₜ = 0.90   (assigned 90% to the correct outcome: not spam)
Email 3:  pₜ = 0.03   (assigned 3% to the correct outcome: spam)

A good model makes p_t large for every sample. We want to maximize it. But maximizing products of probabilities is numerically treacherous (numbers get tiny fast), so we take the negative logarithm and minimize instead:

Email 1:  -log(0.95) = 0.05    barely any loss
Email 2:  -log(0.90) = 0.11    small loss
Email 3:  -log(0.03) = 3.51    enormous loss

Email 3's loss is 70 times larger than Email 1's. Not because the model was wrong — it could have said 50% and been wrong — but because it was confident and wrong. It said "almost certainly not spam" about a message that was spam. The log creates exactly the penalty structure we need.

Written out in the standard form where y ∈ {0, 1} and p is the model's predicted probability of class 1:

BCE = -[y · log(p) + (1-y) · log(1-p)]

When y = 1, the first term is active: -log(p). When y = 0, the second term is active: -log(1-p). Either way, we're computing -log(p_t) — the negative log of the probability assigned to the truth. That's binary cross-entropy.

Why Logarithm? The Information Surprise

I'll be honest — the log in cross-entropy confused me for a long time. It seemed like an arbitrary mathematical convenience. It isn't. The logarithm is doing something profound.

In information theory, -log(p) measures surprise — how shocked you should be when an event with probability p actually occurs. High probability event happens? Not surprised: -log(0.99) ≈ 0.01. Coin flip? Moderately surprised: -log(0.5) ≈ 0.69. Something you thought was nearly impossible happens? Catastrophically surprised: -log(0.01) ≈ 4.6.

The surprise function has two critical properties. First, it's zero when you're perfectly certain and correct (p = 1 → -log(1) = 0). Second, it goes to infinity as your confidence in the wrong answer goes to infinity (p → 0 → -log(p) → ∞). That means there's no ceiling on how badly the model can be punished for confident wrong predictions.

Look at the asymmetry carefully. Improving from p = 0.95 to p = 0.99 reduces the loss from 0.05 to 0.01 — a savings of 0.04. Improving from p = 0.01 to p = 0.05 reduces the loss from 4.6 to 3.0 — a savings of 1.6. The model gets 40 times more reward for rescuing a terrible prediction than for polishing an already-good one. This is exactly right. A classifier that says "98% class A" when the answer is B should be punished far more than one that says "55% class A."

The log also gives us something mathematically elegant: it converts the product of probabilities (likelihood) into a sum (log-likelihood). Products of many small numbers underflow to zero in floating-point arithmetic. Sums of their logarithms stay manageable. This is why we can train on batches of thousands of samples without our loss collapsing to numerical noise.

The Likelihood Connection

Cross-entropy and maximum likelihood are the same thing wearing different hats. If the model predicts P(y=1|x) = p for each sample, the likelihood of all observed labels is:

L = Π pᵢ^yᵢ × (1-pᵢ)^(1-yᵢ)

Taking the negative log:

-log L = -Σ [yᵢ · log(pᵢ) + (1-yᵢ) · log(1-pᵢ)]

That's the sum of binary cross-entropies across all samples. Minimizing BCE is maximum likelihood estimation. Not analogous to it. Not inspired by it. It is it. Every time you train a classifier with cross-entropy, you're fitting a Bernoulli distribution to each data point and finding the parameters that make the observed labels most probable. That identity connects loss functions to the entire machinery of statistical inference — confidence intervals, hypothesis tests, model comparison — all of it becomes available once you see the loss through the likelihood lens.

Categorical Cross-Entropy and KL Divergence

Binary cross-entropy handles two classes. What about K classes? The same principle extends naturally. The true label becomes a one-hot vector (one class is 1, the rest are 0). The model outputs a probability distribution over all K classes via softmax. The loss is:

CE = -Σₖ yₖ · log(pₖ)

Since y is one-hot, all terms vanish except the one corresponding to the true class. The loss collapses to -log(p_correct) — the negative log of whatever probability the model assigned to the right answer. Our spam classifier generalizes to a ten-class email categorizer: if the model puts 0.9 probability on the correct category, loss is 0.105. If it puts 0.01, loss is 4.605.

There's a deeper result here that connects cross-entropy to a foundational concept in information theory. Cross-entropy between the true distribution P and the model's distribution Q decomposes as:

H(P, Q) = H(P) + D_KL(P ‖ Q)

H(P) is the entropy of the true distribution — for hard (one-hot) labels, this is zero. D_KL(P ‖ Q) is the Kullback-Leibler divergence, which measures how different Q is from P in information-theoretic terms. Since H(P) is constant with respect to the model parameters, minimizing cross-entropy is identical to minimizing KL divergence. You're making the model's probability distribution as close as possible to the true distribution, measured by the most natural distance that information theory provides.

This chain of equalities is worth internalizing: minimizing cross-entropy = maximizing likelihood = minimizing KL divergence. Three frameworks, one operation. It doesn't matter which lens you use to think about it — the gradients are the same, the optimum is the same. But having all three perspectives helps you reason about why certain things work.

Sparse vs Dense Labels

Same math, different input format. Dense cross-entropy expects one-hot labels: [0, 0, 1, 0, 0]. Sparse expects integer indices: 2. The computation is identical — both look up -log(p_correct). Use sparse when you have many classes. ImageNet has 1,000 classes — storing 999 zeros per sample wastes memory for no reason. PyTorch's nn.CrossEntropyLoss takes integer labels by default. TensorFlow makes the distinction explicit: CategoricalCrossentropy vs SparseCategoricalCrossentropy.

The Numerical Landmine and the Logits Trick

There's a practical trap that has bitten every deep learning practitioner at least once. The log in cross-entropy means log(0) = -∞. If your model ever predicts exactly p = 0 or p = 1 — which sigmoid or softmax can approach asymptotically but which floating-point arithmetic can reach through rounding — the loss becomes infinite and training explodes into NaN.

The naive fix is clamping predictions away from 0 and 1:

eps = 1e-7
pred = torch.clamp(pred, eps, 1 - eps)
loss = -(y * torch.log(pred) + (1 - y) * torch.log(1 - pred))

This works but introduces a subtle problem: you've distorted the gradients near the boundaries, and you're still computing sigmoid and log as separate floating-point operations, which loses precision at extremes.

The proper solution is to never compute the activation and the log separately. PyTorch's F.binary_cross_entropy_with_logits takes raw logits — the values before sigmoid — and fuses sigmoid + log into a single numerically stable operation using the identity:

-log(sigmoid(z)) = log(1 + e⁻ᶻ) = softplus(-z)

This never computes sigmoid explicitly, so it never produces the 0.0 or 1.0 that would blow up the log. The same principle applies to multi-class: nn.CrossEntropyLoss takes raw logits and computes softmax + log internally via the log-sum-exp trick.

# CORRECT: always pass raw logits
loss_bce = F.binary_cross_entropy_with_logits(logits, labels)
loss_ce  = F.cross_entropy(logits, integer_labels)

# DANGEROUS: separate activation then loss
probs = torch.sigmoid(logits)
loss = F.binary_cross_entropy(probs, labels)  # can produce NaN

The Rule

If you ever find yourself writing torch.sigmoid() or F.softmax() followed by a separate loss computation, stop and refactor. The _with_logits variants and CrossEntropyLoss exist precisely because the fused computation is not optional — it's the only way to handle extreme logits without numerical catastrophe. An input logit of 100 would produce sigmoid(100) ≈ 1.0 in float32, and then log(1 - 1.0) = log(0) = -∞. The fused version handles this cleanly.

Focal Loss — When Easy Examples Drown the Signal

Standard cross-entropy treats every sample equally. For balanced datasets, that's fine. But imagine training an object detector. In every image, the model evaluates thousands of possible regions. Maybe 10 contain an actual object. The other 9,990 are background. Each background region is an easy negative — the model quickly learns to say "no object here" with 99.9% confidence. But those easy negatives still contribute to the total loss. Ten thousand easy predictions, each contributing a tiny amount, add up to a massive signal that drowns the few hard positives the model actually needs to learn from.

The model spends its gradient budget polishing already-correct predictions instead of wrestling with the hard cases. It's like a student who keeps re-reading chapters they already understand because it feels productive.

Focal loss, introduced in the RetinaNet paper (Lin et al., 2017), fixes this with a modulating factor that downweights easy examples:

FL(pₜ) = -αₜ × (1 - pₜ)ᵞ × log(pₜ)

The (1 - p_t)^γ factor does the work. When p_t is close to 1 (the model is confident and correct), (1 - p_t) is close to 0, and raising it to the power γ makes it even smaller. That easy example contributes almost nothing to the loss. When p_t is close to 0 (the model is wrong), the factor is close to 1, and the loss is essentially standard cross-entropy.

Let's see the numbers. With γ = 2 (the value from the original paper):

Easy example:  pₜ = 0.95 → (1 - 0.95)² × (-log 0.95) = 0.0025 × 0.05 = 0.000125
Hard example:  pₜ = 0.10 → (1 - 0.10)² × (-log 0.10) = 0.81 × 2.30   = 1.863

The easy example's contribution shrinks by a factor of 400 compared to standard cross-entropy. The hard example barely changes. The gradient signal is now dominated by the samples the model actually needs to learn from.

The α_t parameter handles class frequency separately — it's a per-class weight, like weighted cross-entropy. The γ parameter handles prediction difficulty — it's what makes focal loss genuinely different from weighted cross-entropy. Weighted CE adjusts for how often a class appears. Focal loss adjusts for how confidently the model handles each individual sample. They solve different problems, and focal loss is the one you need when the issue is overwhelming easy negatives, not raw class counts.

I still need to look up whether γ should be 2 or 5 for a particular application. The original paper found γ = 2 worked best for object detection. At γ = 0, focal loss degenerates to standard cross-entropy. At very high γ, even moderately difficult examples get suppressed, which can hurt if your dataset has a gradual spectrum of difficulty rather than a clean easy/hard split.

The RetinaNet result was striking: a single-stage detector — architecturally simpler than the two-stage models it competed against — beat every existing architecture. The bottleneck wasn't the model. It was the loss function. The model had always been capable; the loss was preventing it from learning.

Rest Stop

If you've made it here, you now have the complete toolkit for the losses you'll use 95% of the time: MSE, MAE, and Huber for regression; binary and categorical cross-entropy for classification; focal loss for imbalanced classification. You understand the statistical assumptions each one makes, the optimization landscapes they create, and the numerical traps to avoid.

The short version for anyone ready to stop: use MSE for regression with clean data, Huber for regression with outliers, cross-entropy with logits for classification, focal loss when class imbalance is extreme. That covers most production ML.

But loss functions go further. There are losses designed not to measure "how wrong is this prediction" but "how well do these embeddings organize the world." If the idea of loss functions that don't compare a prediction to a label — but instead compare samples to each other — sounds intriguing, read on.

Beyond the Basics — Contrastive, Triplet, and Quantile Losses

Everything we've covered so far compares a model's output to a ground truth label. Contrastive and triplet losses break that pattern entirely. Instead of asking "how close is this prediction to the right answer?" they ask "are similar things close together and dissimilar things far apart in the model's internal representation?"

Contrastive Loss

Contrastive loss operates on pairs. Given two inputs, it either pulls their embeddings together (if they're from the same class) or pushes them apart (if they're different). A margin parameter m defines the minimum distance we want between dissimilar pairs:

L = (1-y) × ½D² + y × ½max(0, m - D)²

When the pair is similar (y = 0), the loss penalizes any distance between them — it wants D = 0. When the pair is dissimilar (y = 1), the loss only activates if D < m — if they're already far enough apart, there's nothing to optimize. The margin m is the border fence: once the dissimilar pair is on opposite sides of the fence, the loss stops caring. This was the foundation for Siamese networks in face verification — same person, pull together; different person, push apart.

Triplet Loss

Triplet loss works with groups of three: an anchor, a positive (same class as anchor), and a negative (different class). The loss ensures the anchor-positive distance is smaller than the anchor-negative distance by at least a margin m:

L = max(0, D(anchor, positive) - D(anchor, negative) + m)

If the negative is already far enough away, the loss is zero — no gradient, nothing to update. This is efficient when it works, but there's a catch: with random sampling, most triplets are "easy" — the negative is already well-separated, so the gradient is zero for most of the batch. Hard negative mining — deliberately selecting negatives that are close to the anchor — becomes essential. FaceNet, the landmark face recognition system from Google, ran entirely on triplet loss with careful hard negative mining. Without it, training stalled.

InfoNCE and Modern Contrastive Learning

The self-supervised learning revolution (SimCLR, MoCo, CLIP) runs on InfoNCE loss, which generalizes contrastive loss to many negatives at once:

L = -log(exp(sim(z, z⁺)/τ) / Σᵢ exp(sim(z, zᵢ⁻)/τ))

It's a softmax over similarities: the numerator is the similarity to the positive, the denominator is the sum of similarities to all negatives. The temperature τ controls how sharply the distribution concentrates on the most similar items. Lower τ makes the loss pickier — it demands clearer separation. Higher τ is more forgiving. This loss function is the reason CLIP can match images to text descriptions and SimCLR can learn visual features without any labels at all.

Quantile (Pinball) Loss

Back in regression territory, but with a twist. MSE and MAE are symmetric — overpredicting by 10 is the same as underpredicting by 10. Quantile loss (also called pinball loss) breaks that symmetry deliberately. For a target quantile q:

L_q(y, ŷ) = q × (y - ŷ)      if y > ŷ   (underprediction)
           = (1-q) × (ŷ - y)  if y ≤ ŷ   (overprediction)

At q = 0.5, this is MAE — symmetric penalty. At q = 0.9, underprediction is penalized 9 times more than overprediction — the model learns to predict the 90th percentile. By training separate models at q = 0.1 and q = 0.9, you get prediction intervals: an 80% confidence band around your forecast. This is how modern demand forecasting, energy prediction, and financial risk models produce uncertainty estimates rather than point predictions.

A Few More Worth Knowing

Hinge loss — max(0, 1 - y · f(x)) — is the loss behind Support Vector Machines. It doesn't care about correct predictions that are already confident (loss is zero when y · f(x) ≥ 1), focusing all attention on the margin boundary. It creates the max-margin classifier.

Label smoothing replaces hard one-hot targets [0, 0, 1, 0] with softened versions [0.025, 0.025, 0.925, 0.025]. It's a regularization trick that prevents the model from becoming pathologically confident. Cross-entropy with hard labels pushes logits toward ±∞ — label smoothing caps this drive and improves calibration.

Knowledge distillation loss combines standard cross-entropy (student vs. true labels) with KL divergence (student vs. teacher's soft predictions). The teacher's softened outputs carry "dark knowledge" — relationships between classes that hard labels discard. A teacher that gives 30% dog, 25% wolf, 45% cat is saying something about dog-wolf similarity that a hard label of "cat" never could.

The Decision Rule

After all of that, here's the practical truth: for 95% of problems, you need one of five losses. The decision takes ten seconds:

Regression, clean data         → MSE
Regression, outliers likely    → Huber (or MAE)
Binary classification          → BCE with logits
Multi-class classification     → Cross-entropy with logits
Extreme class imbalance        → Focal loss

Embeddings and similarity learning use contrastive or triplet loss (or InfoNCE for self-supervised pretraining). Quantile loss when you need prediction intervals. Hinge loss when you're building an SVM.

If you're reaching for something exotic, make sure you've exhausted the basics first. The loss function is rarely the bottleneck — data quality, model capacity, and regularization matter more in most pipelines. But when the loss is wrong, nothing else can compensate. A model optimizing the wrong objective will get better and better at something you don't care about.

PyTorch Patterns That Keep You Safe

import torch
import torch.nn.functional as F

# ── Our house prices: same three houses ──────────────────
y_true = torch.tensor([200.0, 350.0, 180.0])
y_pred = torch.tensor([210.0, 340.0, 400.0])

mse   = F.mse_loss(y_pred, y_true)                    # 16,200.0 — outlier dominates
mae   = F.l1_loss(y_pred, y_true)                      #     80.0 — outlier tamed
huber = F.huber_loss(y_pred, y_true, delta=15.0)       #  1,095.8 — compromise

# ── Binary classification: ALWAYS pass raw logits ────────
labels = torch.tensor([1.0, 0.0, 1.0])
logits = torch.tensor([3.0, -2.5, -3.5])  # raw scores BEFORE sigmoid

loss_bce = F.binary_cross_entropy_with_logits(logits, labels)

# ── Multi-class: CrossEntropyLoss expects raw logits + integer labels ──
logits_mc  = torch.randn(4, 10)
targets_mc = torch.tensor([3, 7, 0, 2])

loss_ce = F.cross_entropy(logits_mc, targets_mc)

# ── Extreme logit: this is why fused operations matter ───
z = torch.tensor([100.0])
safe = F.binary_cross_entropy_with_logits(z, torch.tensor([0.0]))
# Returns 100.0 — correct, no NaN

The code above captures the three things that matter. Regression losses applied to the same data showing how each reacts to the outlier. Binary and multi-class classification always receiving raw logits, never post-activation probabilities. And the extreme logit test proving the fused operation handles what the separate computation can't.

Wrapping Up

If you're still with me, thank you. I hope it was worth it.

We started with three houses and a question — what does "wrong" mean? — and built our way through the statistical assumptions hiding inside MSE (Gaussian noise, mean estimation), the stoic robustness of MAE (Laplacian noise, median estimation), and the diplomatic Huber loss that bridges them. We crossed into probability-land and saw why the logarithm isn't arbitrary — it's information-theoretic surprise, and it's why confident wrong predictions get destroyed. We traced the identity from cross-entropy through maximum likelihood to KL divergence and saw they're the same operation from three angles. We saw focal loss rescue a model that was drowning in easy examples, and we glimpsed the world of losses that compare samples to each other rather than to labels.

My hope is that the next time you start a training run and reach for a loss function, instead of copying whatever the tutorial used, you'll pause for a moment and ask: what am I telling this model about how to define "wrong"? Because the model will faithfully optimize whatever you hand it. It's your job to hand it the right thing.

Resources

Lin et al., "Focal Loss for Dense Object Detection" (2017) — the RetinaNet paper that proved the loss, not the architecture, was the bottleneck. The experimental ablation over γ values is wildly informative.
Huber, "Robust Estimation of a Location Parameter" (1964) — the original paper. Remarkably readable for its age, and the mathematical reasoning is beautiful.
Schroff et al., "FaceNet" (2015) — triplet loss and hard negative mining, with practical details on training at scale that most tutorials omit.
PyTorch loss function docs — especially the source code for binary_cross_entropy_with_logits. Reading the implementation makes the numerical stability argument visceral.
Bishop, "Pattern Recognition and Machine Learning," Chapter 4 — the maximum likelihood ↔ cross-entropy derivation, done rigorously. If you want the full probabilistic framework, this is the O.G. source.
Chen et al., "A Simple Framework for Contrastive Learning" (SimCLR, 2020) — InfoNCE loss in action. The ablations on temperature and batch size are insightful and surprisingly accessible.

← Previous Activation Functions Next → Backpropagation