Nice to Know

Chapter 5: Supervised Learning GLMs · Survival · Calibration · Multi-label · Ordinal · Online

I'll be honest — I spent a long time thinking supervised learning meant "classification or regression, pick one." Linear regression for continuous targets, logistic regression for binary ones, maybe a random forest if I was feeling adventurous. And for a while, that was enough.

Then I ran into a dataset of insurance claims where the target was a count of incidents per year. I tried linear regression. It predicted negative claims for some customers. That's not a thing. I tried classification. It threw away all the information about how many claims. I was stuck in a gap between the two tools I knew, and I stayed stuck until someone mentioned the phrase "generalized linear model" and my whole mental framework cracked open.

This section covers the supervised learning topics that live in that gap — the techniques that solve specific, real problems that plain classification and regression handle badly. None of them are exotic. They each exist because practitioners kept bumping into the same walls and got tired of pretending the walls weren't there.

We'll walk through each one using a running example: building a lending platform that needs to make decisions about loan applications. It turns out that a single domain like lending naturally hits almost every one of these "nice to know" topics. That's not a coincidence — real systems rarely fit neatly into textbook categories.

Generalized Linear Models — When Your Target Misbehaves

Our lending platform needs to predict how many times a borrower will miss a payment over the next year. The target is a count: 0, 1, 2, 3, and so on. Counts can't be negative. They're often skewed — most people miss zero or one payment, a few miss many. Linear regression doesn't know any of this. It models the target as a real number on a bell curve, and it will happily predict −0.7 missed payments for a good borrower. That answer is nonsensical.

A Generalized Linear Model fixes this by doing two things differently. First, it lets you specify the distribution your target actually follows. For counts, that's usually a Poisson distribution — a bell-like shape but pinned to non-negative integers. For strictly positive continuous values like dollar amounts, you'd use a Gamma distribution. For binary outcomes, the Binomial distribution — and when you pair that with a logit link, you get logistic regression. Logistic regression has been a GLM all along. That fact genuinely surprised me when I first learned it.

The second piece is the link function. It's the mathematical bridge between the linear combination of features (which can be any real number) and the mean of the target distribution (which might need to be positive, or between 0 and 1). For Poisson regression, the link function is the natural log. The model computes a linear combination of features, and then exponentiates it to produce a prediction that's always positive. For logistic regression, the link is the logit function, which maps the real line to the 0-1 interval. The link function is doing the hard work of respecting the target's constraints while letting the model stay linear on the inside.

Think of GLMs as linear regression wearing different lenses. The core engine — a linear combination of features — stays the same. But the lens (link function) and the film (distribution) change to match what you're actually modeling. Insurance claim counts? Poisson with log link. Claim dollar amounts? Gamma with log link. Whether someone defaults at all? Binomial with logit link. The "generalized" part is the realization that all of these are the same machine with different settings.

The limitation worth knowing: GLMs still assume a linear relationship between features and the linked target. If the true relationship is wildly nonlinear, a GLM won't capture it. That's when you start reaching for tree-based models or neural networks — but even those sometimes benefit from being wrapped in a GLM-like output layer that respects the target distribution.

Survival Analysis — Modeling Time When the Clock Hasn't Stopped

Back to our lending platform. We don't only want to know whether a borrower will default. We want to know when. "This borrower has a 15% chance of defaulting within 6 months, but a 40% chance within 3 years" is far more useful than a flat "high risk" label.

The problem is that our data has a hole in it. Some borrowers are still making payments — they haven't defaulted yet, but we can't say they never will. If we throw out these observations, we're biasing the model toward people whose loans already ended, which skews everything. If we label them as "no default," that's a lie — we don't know that. This is called censoring, and it's the central problem that survival analysis was built to solve.

Right-censoring is the most common type: the event (default, death, churn) hasn't happened by the time we stop watching. Imagine photographing a race while it's still running. You know some runners haven't finished yet — you can't mark them as "did not finish." Survival analysis keeps these partial observations in the dataset and extracts information from them. The fact that a borrower has survived 18 months without defaulting is information, even though we don't know the final outcome.

The Kaplan-Meier estimator is the simplest survival tool. It produces a step-function curve showing what fraction of the population has "survived" (not experienced the event) at each point in time. No features, no modeling — it's the survival equivalent of computing a mean. Useful for visualization and comparison between groups, but it can't tell you why some people default faster.

For that, there's the Cox proportional hazards model. It models the hazard rate — the instantaneous risk of the event at time t, given survival up to that point. The clever part: it separates the baseline hazard (how the overall risk changes over time) from the feature effects (how being a higher-income borrower shifts that risk). The model doesn't assume any particular shape for the baseline hazard. It only says that the features multiply the hazard by a constant factor. Higher income might cut your hazard in half. A previous bankruptcy might triple it. These are hazard ratios, and they're directly interpretable.

Machine learning has extended survival analysis with random survival forests and deep learning variants like DeepSurv, which handle nonlinear feature effects that Cox can't capture. Python's lifelines library makes classical survival analysis surprisingly accessible.

The gotcha that trips people up in interviews: you cannot drop censored observations. You also cannot treat them as "no event." Both approaches introduce systematic bias. And if censoring is informative — meaning people drop out of your dataset because they're about to default (say, they refinance elsewhere when they're struggling) — then even standard survival methods break down. That's a deeper problem, and being honest about it is more impressive in an interview than pretending it doesn't exist.

Quantile Regression — When the Average Is a Lie

Our lending platform has built a model predicting expected loss on each loan. The average expected loss across a portfolio is $2,000. That sounds manageable. But what does the distribution look like? If the 95th percentile of loss is $50,000, that's a completely different risk profile than if it's $4,000. The average told us almost nothing about the tail risk that could sink the business.

Standard regression optimizes for the mean. Quantile regression optimizes for any quantile you want — the median (50th percentile), the 90th percentile, the 5th percentile. It does this by swapping out the squared-error loss function for an asymmetric one called the pinball loss (or check loss). For the median, the pinball loss penalizes over-predictions and under-predictions equally, but by their absolute value rather than their square. For the 90th percentile, under-predictions get penalized 9 times more heavily than over-predictions, which forces the model's predictions up toward the high end of the distribution.

Here's the intuition I find most useful: ordinary regression draws a single line through the middle of your data. Quantile regression draws multiple lines at different heights — one through the middle, one near the top, one near the bottom. Together, they sketch the shape of the conditional distribution. Where the lines are far apart, there's high uncertainty. Where they converge, the model is confident. You get prediction intervals for free, and unlike confidence intervals from linear regression, these don't assume normality.

Practical uses beyond lending: demand forecasting (you need to stock for the 95th percentile, not the average), delivery time estimates (Uber and Amazon show ranges, not point estimates), and energy grid planning (the 99th percentile of demand is what determines whether you need a new power plant).

One quirk worth knowing: if you fit quantile regressions independently for different quantiles, the predicted lines can cross. Your 90th percentile prediction for a specific input might end up lower than the 50th percentile prediction. This is called quantile crossing, and it's physically nonsensical. Joint quantile models exist to prevent this, but they're more complex. In practice, crossing happens most often in sparse data regions — another reminder that edge cases are where things get interesting.

Bayesian Linear Regression — Uncertainty as a Feature, Not a Bug

Ordinary least squares gives you coefficients. Single numbers. "The effect of income on default probability is −0.03." Is that precise? Could it be −0.01 or −0.08? OLS doesn't tell you, not directly. You can compute confidence intervals, but they rely on assumptions about residual distributions that may not hold.

Bayesian linear regression takes a fundamentally different approach. Instead of producing single-point estimates for coefficients, it produces full probability distributions. The coefficient for income isn't −0.03. It's a distribution centered around −0.03, spread out according to how uncertain the model is. With lots of data, that distribution tightens. With little data, it stays wide. The uncertainty is baked into the answer.

Every prediction then carries natural uncertainty bounds. Not "the expected loss is $2,000" but "the expected loss is $2,000, and we're 95% sure it's between $800 and $4,500." These are credible intervals, the Bayesian counterpart to confidence intervals, and they have a more intuitive interpretation: there's a 95% probability the true value falls in this range, given the data and prior.

The cost is computation. Bayesian inference is more expensive than OLS, sometimes much more so. For large datasets where you have plenty of signal, the extra uncertainty quantification may not be worth the compute. But for small datasets, high-stakes decisions, or situations where the model operates near the edge of its training distribution — basically, wherever knowing what you don't know matters as much as knowing what you do — Bayesian regression earns its keep.

I'll be honest: I still find the mechanics of prior selection somewhat uncomfortable. Choosing a prior feels like smuggling in assumptions, and the question "but how do you pick the prior?" comes up in every conversation about Bayesian methods. The pragmatic answer is that with enough data, the prior gets overwhelmed by the evidence, and different reasonable priors converge. With very little data, the prior matters — and that's actually a feature, because with very little data you should be bringing in prior knowledge. The prior makes that explicit rather than hiding it.

Probability Calibration — When "80% Confident" Means 60%

Our lending platform has a classifier that scores each application. It says it's 80% confident that a certain borrower will repay. We use that probability to set an interest rate. Months later, we discover that among all the borrowers the model rated at 80%, only 60% actually repaid. The model was systematically overconfident, and we priced our loans too cheaply. That's not a classification error — the model was picking the right class more often than not. It's a calibration error, and it's surprisingly common.

Different classifiers have different calibration tendencies. Logistic regression is usually well-calibrated out of the box — it was designed to output probabilities. Random Forests tend to push predictions toward 0.5, because they average many trees and the averages regress toward the center. Gradient boosting models and neural networks tend toward overconfidence — they learn to produce extreme scores during training. SVMs don't even output probabilities natively; their decision function values need to be converted.

The diagnostic tool is a reliability diagram (or calibration curve). Bin your predictions by their predicted probability, then plot the predicted probability against the actual fraction of positives in each bin. Perfect calibration is a diagonal line. If the curve bows below the diagonal, the model is overconfident. If it bows above, it's underconfident. It takes ten seconds to plot and tells you immediately whether you have a calibration problem.

Two standard fixes exist. Platt scaling fits a logistic regression on top of the raw classifier scores — it learns parameters A and B such that the calibrated probability is 1/(1 + exp(A·score + B)). It works well when the relationship between raw scores and true probabilities is roughly sigmoid-shaped, and it only has two parameters, so it's hard to overfit. Isotonic regression is more flexible — it fits a non-parametric, non-decreasing step function from scores to probabilities. No assumptions about the shape of the mapping. But it has more degrees of freedom, so it needs more data to avoid overfitting.

In scikit-learn, CalibratedClassifierCV wraps any classifier with either method. Pass method='sigmoid' for Platt scaling or method='isotonic' for isotonic regression. The critical rule: always calibrate on held-out data, never on the training set. Calibrating on training data gives you perfectly calibrated training predictions and worthless test-time calibration.

Calibration matters most when you're using predicted probabilities for downstream decisions — setting thresholds, pricing, ranking, or as inputs to another model. If you only care about the ranking order (who's most likely to default), calibration doesn't matter. If you care about the actual probability values, it matters enormously.

Multi-label Classification — When Labels Aren't Mutually Exclusive

Standard multiclass classification assumes each sample belongs to exactly one class. Dog or cat. Spam or not spam. Digit 0 through 9. The classes are mutually exclusive — the probabilities sum to one, and softmax enforces this. That's the right setup when you're categorizing into bins.

But our lending platform needs to flag risk factors on each application, and a single application might trigger multiple flags simultaneously: "high debt-to-income ratio" AND "employment gap" AND "recent credit inquiry." These aren't mutually exclusive categories. They're independent tags that can co-occur in any combination. This is multi-label classification, and the moment you reach for softmax, you've already made a mistake.

The core fix is mechanical but important: replace softmax with sigmoid on the output layer, and replace categorical cross-entropy with binary cross-entropy (applied independently to each label). Softmax forces all output probabilities to sum to 1, which means raising one prediction forces others down. Sigmoid treats each output independently — a sample can be 90% likely to have flag A and 85% likely to have flag B simultaneously. That's the whole insight in one sentence: sigmoid for multi-label, softmax for multiclass.

Three classical approaches exist for multi-label problems. Binary relevance trains a separate independent binary classifier for each label. It's the simplest approach and scales well, but it completely ignores correlations between labels. If "high debt-to-income" and "employment gap" tend to co-occur, binary relevance won't learn that. Classifier chains fix this by arranging the labels in a sequence — each classifier gets the input features plus the predictions from all previous classifiers in the chain. This captures dependencies, but the order of the chain matters, and different orders give different results. Ensembles of classifier chains (training multiple chains with random orderings and averaging) mitigate this.

In deep learning, the sigmoid multi-output approach dominates because neural networks can learn label correlations implicitly through shared hidden layers, without the explicit chaining machinery. The architecture is identical to multiclass — the only changes are the final activation and the loss function.

Evaluation is different too. Accuracy in multi-label means every single label must be correct for a sample to count as "right" — that's subset accuracy, and it's brutally strict. Hamming loss counts the fraction of individual labels that are wrong, which is usually more informative. Micro and macro F1 scores generalize from binary classification. The choice matters — pick the wrong metric and you'll optimize for the wrong thing.

Ordinal Regression — When Order Matters but Distances Don't

Our lending platform assigns credit ratings: A, B, C, D, E. These have a clear order — A is the best, E is the worst. But is the gap between A and B the same as between D and E? Almost certainly not. If we treat this as standard classification, the model sees no difference between predicting A when the truth is B (off by one notch) and predicting A when the truth is E (off by four notches). Both are equally "wrong." That's absurd.

If we treat it as regression by mapping A=1, B=2, C=3, D=4, E=5, we've assumed the gaps are equal and the target is continuous. Also wrong, but in a different way — the model might predict 2.7, and now we're rounding to a category, which feels arbitrary.

Ordinal regression threads the needle. The most elegant version — the proportional odds model (also called a cumulative link model) — imagines that there's a hidden continuous score for each borrower. You can think of it as their "true creditworthiness" on an invisible number line. The model learns both the relationship between features and this hidden score, and the thresholds that divide the number line into ordered categories. The thresholds aren't evenly spaced — the model figures out where the natural breaks fall.

Mathematically, it models P(Y ≤ k) = sigmoid(θ_k − w^Tx) for each threshold k. The feature weights w are shared across all thresholds — the features shift the borrower's position on the hidden score, and the thresholds determine which category that position falls into. For five categories, you learn four thresholds. The elegance is that a single set of weights works for all categories, and the ordering is built into the structure.

In practice, scikit-learn doesn't have native ordinal regression (a common interview gotcha — knowing this shows you've actually tried to use it). The mord library in Python fills the gap. In R, polr() from the MASS package is the standard tool. For deep learning, the CORAL (Consistent Rank Logits) approach extends ordinal regression to neural networks by converting ordinal targets into a sequence of binary tasks while sharing weights.

Use ordinal regression for survey responses, star ratings, severity scores, educational levels — any target where being close to the right answer is meaningfully better than being far off, but where the distances between categories aren't meaningful numbers.

Online Learning — When Data Never Stops Arriving

Our lending platform processes thousands of applications per day. We could retrain the model every night on all historical data. But the data keeps growing, and "retrain on everything" becomes increasingly expensive. Worse, the patterns shift — economic conditions change, new fraud patterns emerge, borrower behavior evolves. By the time we retrain on last month's data, last month's patterns may already be stale.

Online learning updates the model incrementally as each new data point (or small batch) arrives. Instead of learning from the entire dataset at once, the model sees one example, adjusts slightly, and moves on. In scikit-learn, SGDClassifier and SGDRegressor support this through partial_fit() — you can feed in new data without reloading the old. The model carries forward what it learned and integrates new evidence on the fly.

The deeper motivation isn't only efficiency. It's concept drift — the phenomenon where the relationship between features and target changes over time. A fraud model trained on 2022 patterns will miss 2024 fraud tactics. Online learning naturally adapts because recent examples have more influence than ancient ones. Some algorithms explicitly incorporate forgetting mechanisms — exponential decay on older data, sliding windows, or adaptive learning rates that increase when the model detects its predictions are getting worse.

The tricky part is evaluation. Standard k-fold cross-validation assumes data points are independent and identically distributed, which streaming data violates (tomorrow's data depends on today's economy). Prequential validation (also called progressive validation) is the online learning equivalent: for each new data point, first make a prediction, then reveal the true label and use the error to update the model. Your test error is computed from the predictions you made before seeing the answers. This is honest evaluation for non-stationary data.

The formal framework for online learning introduces the concept of regret — the cumulative difference between the model's loss and the loss of the best fixed model in hindsight. A good online learning algorithm has sublinear regret, meaning it converges toward the best fixed model over time even though it never retrains from scratch. If an interviewer asks about online learning and you mention regret, you've demonstrated that you understand the theoretical foundation, not only the partial_fit() API.

The Interview Landmines

These "nice to know" topics have a disproportionate number of gotchas that come up in interviews with experienced practitioners. Here are the ones I've seen trip people up most often.

GLMs: "Is logistic regression a GLM?" Yes. It's the binomial family with a logit link. Many people who use logistic regression daily don't realize it's a special case of a broader framework. The follow-up: "What's the link function for Poisson regression?" It's the natural log — because exponentiating the linear predictor guarantees a positive count.

Survival analysis: "Can you drop censored observations from your dataset?" No. This introduces survivorship bias. Censored observations contain real information — the fact that someone survived up to time t constrains the survival function. Dropping them biases your model toward shorter survival times. This is the single most common conceptual error in survival analysis.

Calibration: "Your random forest says 80% probability of default. Should you trust that number?" Not without checking calibration first. Random forests are notoriously poorly calibrated — they push predictions toward 0.5 because they average across trees. Gradient boosting tends toward overconfidence. Logistic regression is usually well-calibrated. The follow-up question: "How would you fix it?" Platt scaling or isotonic regression, always on held-out data.

Multi-label: "Why not use softmax for multi-label classification?" Because softmax forces probabilities to sum to 1, enforcing mutual exclusivity. For multi-label problems you need sigmoid — each label gets an independent probability. Using softmax for multi-label is like being told you can only pick one topping on a pizza that allows unlimited toppings. The architecture is wrong at a fundamental level.

Quantile regression: "Can the 90th percentile prediction be lower than the 50th percentile?" Yes — this is quantile crossing, and it happens when quantiles are fit independently. It's physically impossible but mathematically allowed. Knowing this problem exists, and that joint quantile models address it, shows depth.

Online learning: "How do you evaluate an online learning model?" Not with k-fold cross-validation — that assumes i.i.d. data. Prequential validation: predict first, then update. This is one of those answers where knowing the right evaluation strategy matters more than knowing the model itself.

What You Should Now Be Able To Do

Recognize when your target distribution calls for a GLM and identify which family and link function to use — and know that logistic regression has been a GLM all along
Explain why censored observations can't be dropped in survival analysis and describe how the Cox model separates baseline hazard from feature effects
Reach for quantile regression when the spread of outcomes matters more than the center, and know about the quantile crossing problem
Distinguish Bayesian regression's distributional outputs from point estimates, and articulate when the computational cost is worth the uncertainty quantification
Diagnose calibration problems with a reliability diagram and fix them with Platt scaling or isotonic regression — on held-out data
Use sigmoid (not softmax) for multi-label classification and choose the right evaluation metric for the problem
Apply ordinal regression when your categories have natural order but unequal spacing, using the proportional odds model or CORAL for deep learning
Implement online learning with prequential validation and explain the concept of regret

← Previous Time Series Forecasting