Linear Models

Chapter 5: Supervised Learning Linear Regression · Logistic Regression · GLMs

I avoided a deep dive into linear models for longer than I'd like to admit. Every time someone mentioned "linear regression" in an interview prep session or a design review, I'd nod and think, "Yeah, I know that one. Straight line, least squares, move on." I treated them the way you treat the hometown you grew up in — too familiar to be interesting, not worth a second look. Finally the discomfort of realizing I couldn't answer even moderately probing questions about why regularization produces sparsity, or what the normal equation is actually doing geometrically, grew too great for me. Here is that dive.

Linear models are the oldest workhorses in machine learning. Legendre and Gauss argued over who invented least squares around 1805. Two centuries later, linear and logistic regression remain the first models most teams reach for in production — not out of laziness, but because they're interpretable, fast, and surprisingly effective. Everything that came after — neural networks, SVMs, even tree ensembles — borrows ideas (loss functions, regularization, gradient-based optimization) that were first worked out in the linear setting.

Before we start, a heads-up. We're going to build up from a single number on a number line to matrix equations, geometric intuitions for regularization, the sigmoid function, and eventually a unified framework that ties all linear models together. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The Route

What "Linear" Actually Means
One Feature, One Line, One Coffee Shop
The Loss: Why We Square Things
The Normal Equation
When the Matrix Gets Too Big: Gradient Descent
More Features, More Problems: Multicollinearity
Rest Stop
Regularization: The Budget Constraint
The Geometry That Explains Sparsity
The Bayesian Secret Identity
ElasticNet: The Compromise
From Numbers to Yes/No: The Sigmoid
Log-Loss: A New Measure of Wrongness
The Decision Boundary and Odds Ratios
Finding the Weights: Why We Must Iterate
Rest Stop
The Bigger Picture: Generalized Linear Models
Production Realities
Wrap-Up
Resources

What "Linear" Actually Means

This is where almost everyone gets tripped up, and I'll be honest — it got me too for an embarrassingly long time. When someone says "linear model," the natural reaction is to picture a straight line. And that's partly right. But the word "linear" in "linear regression" doesn't mean the relationship between input and output has to be a straight line. It means the model is linear in the parameters — the weights you're trying to learn.

Consider this model for predicting monthly revenue at a coffee shop:

revenue = w₀ + w₁ × foot_traffic + w₂ × foot_traffic²

That foot_traffic² term makes the curve bend. The relationship between foot traffic and revenue is no longer a straight line. Yet this is still a "linear model," because if you squint at the weights — w₀, w₁, w₂ — each one appears exactly once, raised to the first power, not multiplied by another weight. The parameters enter the equation linearly. The features can be as wild as you want — x², log(x), sin(x), x₁ × x₂ — and as long as the weights combine linearly, the entire toolkit of linear regression still applies. Closed-form solutions, the normal equation, all of it.

This distinction matters because it means "linear regression" is far more flexible than a straight line. Polynomial regression, interaction terms, basis expansions — all linear models. The line between "linear" and "nonlinear" models is not where most people draw it.

One Feature, One Line, One Coffee Shop

Imagine we're helping a friend decide whether to open a coffee shop on a particular street. We've collected data from 12 existing shops: daily foot traffic past the door, and monthly revenue. We plot them and see an upward trend — noisy, but unmistakable. More people walking by means more money. Our job is to turn that noisy cloud of dots into a formula.

The simplest possible formula is a line:

revenue = w₀ + w₁ × foot_traffic

Two numbers. That's all we need to learn. w₁ tells us how much each additional pedestrian is worth in monthly revenue. w₀ is the baseline — what the shop would earn even with zero foot traffic (rent-free coffee for ghosts, presumably). If w₁ comes out to 15, we can tell our friend: "Each extra person passing your door per day is worth roughly $15 a month in revenue." Try explaining a gradient-boosted ensemble's reasoning with the same clarity.

The interpretability isn't a bonus feature. It's the reason linear models survive in healthcare, finance, and any domain where someone with signing authority needs to understand why the model says what it says.

The Loss: Why We Square Things

We have 12 coffee shops, 12 data points. An infinite number of lines could pass through that noisy cloud. We need a way to pick the best one, which means we need a single number that measures how wrong a given line is. That number is the loss.

The standard choice is to measure the vertical distance between each actual revenue yᵢ and our predicted revenue ŷᵢ, square each distance, and add them up:

Loss = Σᵢ (yᵢ - ŷᵢ)²

Why square the distances instead of using absolute values? Three reasons, and they're all worth knowing.

First, squaring makes every error positive — overestimating by $100 is penalized the same as underestimating by $100. Absolute values do this too, so this alone isn't decisive. Second, squaring punishes large errors disproportionately. One prediction that's off by $200 contributes 40,000 to the loss; four predictions each off by $100 contribute 40,000 total. The squared loss really doesn't want you to be wildly wrong anywhere. Third — and this is the real reason it won — the squared loss creates a smooth, bowl-shaped surface with a single bottom. Think of a valley with one lowest point. No matter where you start, if you walk downhill you'll reach it. That bowl shape is called convexity, and it means there's exactly one best answer. We'll come back to this bowl analogy several times.

There's a deeper reason too. If you assume the noise in your data follows a bell curve — a Gaussian distribution — then minimizing squared error is mathematically identical to maximizing the likelihood of your data. The technique we call Ordinary Least Squares (OLS) and the technique we call Maximum Likelihood Estimation (MLE) are the same calculation, wearing different hats. Legendre and Gauss figured this out circa 1805. Two centuries of validation later, it's still the default.

You could minimize absolute errors instead — that's called L1 loss or Least Absolute Deviations. It's more robust when you have wild outliers that would drag a squared-error line off course. But it has no closed-form solution and isn't differentiable at zero, making optimization trickier. Squared errors win on convenience for the common case.

The Normal Equation

With one feature, finding the best line is a calculus exercise: take the derivative of the loss with respect to w₀ and w₁, set both to zero, solve the two equations. The answer pops out.

Real problems have dozens, hundreds, or thousands of features. Our coffee shop model might include foot traffic, neighborhood median income, distance to the nearest competitor, average rent per square foot, and so on. With p features, the model becomes:

ŷ = Xw     where X is n×(p+1), w is (p+1)×1

X is a matrix. Each row is one coffee shop. Each column is one feature (plus a column of ones for the intercept). The loss in matrix form is (y - Xw)ᵀ(y - Xw). Take the derivative with respect to the whole weight vector, set it to zero, and you get the normal equation:

w = (XᵀX)⁻¹ Xᵀy

One line. Plug in data, get the exact optimal weights. No iteration, no hyperparameters, no learning rate to tune. It's elegant. It's exact.

It also has a catch that bit me in a real project before I understood it. The formula requires inverting the matrix XᵀX. Matrix inversion costs O(p³) — that's p cubed, where p is the number of features. With 100 features, that's a million operations. Fine. With 100,000 features, that's 10¹⁵ operations. Not fine. The matrix also needs to be invertible in the first place. If two features are perfectly correlated — say you included both "temperature in Celsius" and "temperature in Fahrenheit" — XᵀX becomes singular and the inverse doesn't exist. Even near-perfect correlation makes the inversion numerically unstable: tiny changes in data produce wildly different weights.

A practical note: production code should never literally compute (XᵀX)⁻¹. Libraries use QR decomposition or SVD instead, which are numerically far more stable. But the formula is worth knowing because it reveals the structure: the answer depends on XᵀX (how the features relate to each other) and Xᵀy (how each feature relates to the target). That structure shows up everywhere.

When the Matrix Gets Too Big: Gradient Descent

When you have so many features that inverting XᵀX is impractical — or you have so many data points that even forming the matrix blows your memory budget — you need a different approach. Instead of solving for the answer in one shot, you sneak up on it.

Start with random weights. Compute how wrong you are. Figure out which direction reduces the error. Take a small step in that direction. Repeat.

w ← w - α · ∇Loss(w)

∇Loss(w) is the gradient — a vector that points in the direction of steepest increase. We move in the opposite direction, downhill. α is the learning rate, the size of each step. Think of it this way: you're standing somewhere in our bowl-shaped valley, blindfolded, and you can feel the slope of the ground under your feet. You take a step downhill, feel the slope again, step again. Eventually you reach the bottom.

The learning rate α is your first real hyperparameter. Make it too large and you overshoot the bottom of the valley, bouncing back and forth across it, possibly flying off into the distance. Make it too small and you spend an eternity inching forward. The sweet spot is problem-dependent, and finding it is part of the craft.

There are three flavors of gradient descent, distinguished by how much data they look at per step. Batch gradient descent computes the gradient using all n data points — accurate but expensive when n is huge. Stochastic gradient descent (SGD) uses a single randomly chosen data point — fast but noisy, the gradient estimate lurches around like a drunk walk that trends downhill. Mini-batch gradient descent uses a random subset, typically 32 to 512 samples — noisy enough to avoid getting stuck, cheap enough to scale. Mini-batch is what everyone uses in practice.

One thing gradient descent demands that the normal equation doesn't: feature scaling. If foot traffic ranges from 50 to 3,000 and median income ranges from 30,000 to 150,000, the loss surface stretches into a long, narrow valley. The gradient points mostly along the narrow axis, and convergence slows to a crawl. Standardizing each feature to zero mean and unit variance — (x - mean) / std — turns that elongated valley into something closer to a round bowl, and gradient descent converges much faster.

More Features, More Problems: Multicollinearity

Going from one feature to many is conceptually straightforward. More columns in X, more weights to learn, same formula. But it introduces a problem that's invisible until it bites you.

Imagine our coffee shop model includes both "total square footage" and "number of seats." These are heavily correlated — bigger shops have more seats. The model can't decide how to split credit between them. It might assign w = +500 to square footage and w = -300 to seats, or w = +200 to square footage and w = 0 to seats. Both combinations produce similar predictions, but the individual coefficients are meaningless. Worse, add one more data point and the coefficients might flip dramatically. The predictions stay stable, but your interpretation falls apart.

The Variance Inflation Factor (VIF) quantifies how bad this is. For each feature j, regress it against all the other features and compute the R². Then VIF(j) = 1 / (1 - R²_j). A VIF above 10 is a serious problem. Above 5 deserves investigation.

The fixes are straightforward: drop one of the correlated features, combine them into a ratio (square footage per seat), or reach for regularization — which handles collinearity gracefully as a side effect. We're about to build the machinery for that.

Rest Stop

Congratulations on making it this far. You can stop here if you want. You now have a working mental model of linear regression: a line (or hyperplane) fitted by minimizing squared errors, solvable either in one shot via the normal equation or iteratively via gradient descent. You understand the tradeoff between the two, you know why feature scaling matters, and you can spot multicollinearity.

That's enough to build, train, and deploy a basic regression model. The short version of what comes next: regularization prevents your model from memorizing noise, and the sigmoid function extends the same framework to classification problems. There. You're 40% of the way through this section, and you have a solid foundation.

But if the discomfort of not knowing why Lasso produces exact zeros while Ridge doesn't, or how logistic regression and linear regression are actually the same framework in disguise, is nagging at you — read on.

Regularization: The Budget Constraint

This is the single most important concept in this entire section, and the ideas transfer to every model you'll ever use. Neural networks, SVMs, tree ensembles — they all use variations of what we're about to build.

The problem is this: as you add features, your model gets more flexible. Training error goes down. But at some point, the model starts fitting the noise in the training data — the random quirks that won't appear in new data. Test error goes back up. The coefficients grow large, canceling each other in elaborate ways that explain the training data perfectly and generalize nowhere.

Here's the analogy that made it click for me. Imagine you're on a shopping trip with a fixed budget. Without a budget, you'd buy everything in the store — including things you don't need. With a budget, you're forced to prioritize: spend on the items that matter most, skip the rest. Regularization is the budget. It constrains how much total "weight" the model can distribute across features, forcing it to put weight only where it truly helps.

Mathematically, we add a penalty to the loss:

Total Loss = Data Loss + λ × Complexity Penalty

λ (lambda) controls how tight the budget is. Large λ means a tight budget — aggressive shrinkage, potentially underfitting. Small λ means a loose budget — barely constrained, risk of overfitting. You tune it with cross-validation.

Ridge Regression (L2 Penalty)

The L2 penalty charges the sum of squared weights:

Loss_ridge = Σ(yᵢ - ŷᵢ)² + λ Σ wⱼ²

Every weight gets taxed in proportion to its square. Large weights are expensive; small weights are cheap. The effect: all coefficients shrink toward zero, but none reach exactly zero. Every feature stays in the model, but with a dampened voice. Going back to our shopping analogy — Ridge is like having a budget that makes expensive items cost even more, so you buy smaller amounts of everything instead of cutting anything completely.

Ridge still has a closed-form solution, and this is where something beautiful happens:

w_ridge = (XᵀX + λI)⁻¹ Xᵀy

Compare this to the original normal equation: w = (XᵀX)⁻¹ Xᵀy. The only difference is that λI — lambda times the identity matrix. Adding λ to the diagonal of XᵀX makes it always invertible, even when features are perfectly correlated. Regularization serves a double purpose: it prevents overfitting and fixes numerical instability. That's not a coincidence. The instability and the overfitting are two symptoms of the same disease — the model has too much freedom for the data it was given.

Lasso Regression (L1 Penalty)

The L1 penalty charges the sum of absolute weights:

Loss_lasso = Σ(yᵢ - ŷᵢ)² + λ Σ |wⱼ|

The dramatic difference: Lasso drives coefficients to exactly zero. Not approximately zero. Not 0.0001. Exactly 0.0. If you have 100 features and 8 of them actually matter, Lasso will zero out the other 92. It performs automatic feature selection as a natural byproduct of optimization.

In our shopping analogy, Lasso is like a budget that forces you to choose: buy a few items or nothing at all. No "tiny amounts of everything" — you either commit to an item or leave it on the shelf.

The tradeoff: there's no closed-form solution. You can't write down a one-line formula like Ridge. Lasso is solved by coordinate descent — optimize one weight at a time while holding all others fixed, cycle through all weights, repeat until convergence. Each step has a closed-form solution for a single weight, which makes the algorithm surprisingly fast despite being iterative.

But why does the L1 penalty produce exact zeros while L2 doesn't? The answer is geometric, and it took me an embarrassingly long time to internalize it. Let me try to build the picture.

The Geometry That Explains Sparsity

Think of a two-weight model. The weight space is a flat plane with w₁ on one axis and w₂ on the other. The loss function (without regularization) looks like a bowl sitting on that plane, and the bottom of the bowl — the unconstrained optimum — is somewhere off to the side, not at the origin.

Now add the regularization constraint. We're saying the weights must stay within a "budget region" — a shape centered at the origin.

For L2 (Ridge), that budget region is a circle. All points where w₁² + w₂² ≤ t form a disc. For L1 (Lasso), the budget region is a diamond — technically a rotated square. All points where |w₁| + |w₂| ≤ t form a shape with sharp corners sitting exactly on the axes.

The best regularized solution is where the loss function's contour lines — the ovals radiating outward from the bowl's bottom — first touch the constraint region. Picture slowly inflating an oval balloon from the optimum outward until it bumps into the constraint shape.

Here's the key: when an oval bumps into a circle, it can touch anywhere along the smooth surface. The contact point is almost never on an axis, so both weights remain nonzero. But when an oval bumps into a diamond, it's far more likely to hit one of those sharp corners. And the corners sit right on the axes — where one coordinate is exactly zero.

That's the entire explanation. The diamond has corners; the circle doesn't. Corners live on axes; axes are where weights are zero. Lasso produces sparsity because its constraint shape has geometry that funnels solutions toward those corners. In higher dimensions, the diamond becomes a cross-polytope with even more corners (there are 2p of them in p dimensions), and the probability of hitting a corner only increases.

I still find it remarkable that such a fundamental behavioral difference — zero versus nonzero — comes down to whether the constraint shape has sharp corners. No amount of staring at the formulas alone would have given me that intuition. The picture is what makes it stick.

The Bayesian Secret Identity

There's another way to see regularization that's worth carrying around in your head, because it shows up in interviews and connects to a much larger body of ideas.

Ridge regression is equivalent to doing Maximum A Posteriori (MAP) estimation with a Gaussian prior on the weights. In plain language: before seeing any data, you believe each weight is probably close to zero, distributed like a bell curve centered at zero. The data then pulls the weights away from zero, but only as far as the evidence warrants. The penalty λ Σ wⱼ² is the negative log of that Gaussian prior. Large λ means a narrow bell curve — a strong belief that weights should be small. Small λ means a wide bell curve — a weak belief that you're easily overridden by data.

Lasso regression is MAP estimation with a Laplace prior — a sharper, pointier distribution that puts more probability mass right at zero and has heavier tails. The shape of the Laplace distribution is what encodes the belief that most weights should be exactly zero, but a few can be large. The L1 penalty λ Σ |wⱼ| is the negative log of that Laplace prior.

So choosing between Ridge and Lasso isn't an arbitrary decision about L1 versus L2. It's a statement about what you believe about the world before you look at the data. "I think all features contribute a little" → Ridge (Gaussian prior). "I think most features are irrelevant and a few drive everything" → Lasso (Laplace prior).

I'm still developing my intuition for when this Bayesian framing is practically useful versus when it's more of a philosophical comfort. But it's come up in every senior-level interview I've seen on the topic, so it's worth having in your back pocket.

ElasticNet: The Compromise

Lasso has a flaw. When features are correlated — and in the real world, they almost always are — Lasso behaves erratically. It'll pick one feature from a correlated group and zero out the rest, and which one it picks can change if you add a single data point. Run it on ten different random splits and you get ten different selected features. That's uncomfortable if you care about interpretation.

ElasticNet combines both penalties:

Loss = Σ(yᵢ - ŷᵢ)² + λ[ρ Σ|wⱼ| + (1-ρ) Σwⱼ²]

The parameter ρ (called l1_ratio in sklearn) controls the mix. At ρ = 1, you get pure Lasso. At ρ = 0, pure Ridge. In between, you get the best of both worlds: the L1 component still produces some exact zeros (feature selection), while the L2 component encourages correlated features to share coefficients rather than one randomly winning. Zou and Hastie, who introduced ElasticNet in 2005, called this the grouping effect — correlated features tend to be selected or dropped as a group, not individually.

My practical guideline: if you don't know the correlation structure of your features — and you usually don't at the start — begin with ElasticNet at ρ = 0.5 and tune from there with cross-validation. It's a safer default than either pure Ridge or pure Lasso.

From Numbers to Yes/No: The Sigmoid

Our coffee shop model predicts revenue — a number that can be anything from −∞ to +∞ (well, negative revenue would be unfortunate, but the math allows it). But what if the question changes? Instead of "how much revenue," our friend asks: "Will this shop survive its first year — yes or no?"

We need the output to be a probability between 0 and 1. The fix is disarmingly simple: take the same linear combination wᵀx + b and feed it through a function that squashes any real number into the range (0, 1). Think of it as a dimmer switch. The linear output is how far you've turned the dial — it can be any number. The dimmer switch converts that into a brightness level between off (0) and full (1).

The specific function we use is the sigmoid:

σ(z) = 1 / (1 + e⁻ᶻ)

But here's the thing most tutorials gloss over: the sigmoid isn't an arbitrary choice we picked because it looks nice. It emerges naturally from a specific assumption. Assume that the log-odds of survival are a linear function of the features:

log(p / (1 - p)) = wᵀx + b

The quantity p / (1 - p) is the odds — familiar if you've ever bet on anything. Odds of 3 means "three times more likely to happen than not." Taking the log of the odds gives us a number that ranges from −∞ to +∞, which is exactly the range of a linear function. Now solve for p, and out comes the sigmoid. We didn't choose the shape. The math handed it to us.

The sigmoid's derivative has a beautiful property: σ'(z) = σ(z) · (1 − σ(z)). You compute the derivative directly from the function's own output, with no extra computation. This is why it was so popular in early neural networks — backpropagation was computationally cheap through the sigmoid layer.

At z = 0, the sigmoid outputs exactly 0.5 — maximum uncertainty. At z = 5, it's about 0.993. At z = −5, about 0.007. The interesting transition happens in roughly the [−5, 5] range. Outside that, the function is effectively flat, which will matter when we talk about optimization.

That's logistic regression. Same linear core as before, same weights, same features — but with one extra function call at the end that converts a linear score into a probability. Everything else — the loss, the optimization, the regularization — flows from this one decision.

Log-Loss: A New Measure of Wrongness

In linear regression, we minimized squared error. Can we do the same here? Technically, yes. Practically, it's a disaster. When you push squared error through the sigmoid, the loss surface becomes non-convex — our nice bowl turns into a landscape with multiple valleys and flat plateaus. Gradient descent gets lost. The blindfolded hiker we imagined earlier would wander into a hollow that isn't the lowest point and stay there.

We need a loss function that stays convex and that punishes confident wrong predictions brutally. Here's what works:

L = −(1/n) Σ [yᵢ · log(pᵢ) + (1 − yᵢ) · log(1 − pᵢ)]

To build the intuition, consider three scenarios for a shop that actually survived (y = 1). If we predicted p = 0.99, the loss for that sample is −log(0.99) ≈ 0.01 — confident and right, almost no penalty. If we predicted p = 0.50, the loss is −log(0.50) ≈ 0.69 — uncertain, moderate penalty. If we predicted p = 0.01, the loss is −log(0.01) ≈ 4.61 — confident and wrong, massive penalty.

The logarithm makes the penalty asymptotic. As your predicted probability for the true class approaches zero, the loss shoots toward infinity. Being 99% sure about the wrong answer costs you enormously more than being 60% sure about the wrong answer. That's exactly the behavior you want.

This loss goes by two names: binary cross-entropy and negative log-likelihood. Minimizing it is mathematically equivalent to maximum likelihood estimation — finding the weights that make the observed data most probable. Two names, same math, same as we saw with OLS and Gaussian MLE in the linear regression case.

The Decision Boundary and Odds Ratios

Here's something that initially surprised me: the sigmoid is not the decision boundary. The decision boundary is the set of points where wᵀx + b = 0. That's a hyperplane — a line in 2D, a flat plane in 3D, a flat surface in higher dimensions. The sigmoid takes the signed distance from that hyperplane and converts it into a probability. Points far on the positive side get p ≈ 1. Points far on the negative side get p ≈ 0. Points right on the boundary get p = 0.5.

This is why it's called logistic regression, not logistic classification. The underlying model draws a straight line through the feature space. It can't curve around clusters. If your classes aren't linearly separable, logistic regression makes the best straight cut it can — and no better. Add polynomial or interaction features if you need curves.

The real superpower of logistic regression, and the reason it dominates in regulated industries, is how you interpret the weights. Each coefficient has a direct, quantifiable meaning through odds ratios.

A one-unit increase in feature xⱼ multiplies the odds of the positive class by e^(wⱼ). If the weight for "number of nearby competitors" is −0.4, then e^(−0.4) ≈ 0.67 — each additional competitor reduces the survival odds by about 33%. If the weight for "foot traffic (hundreds per day)" is 0.8, then e^(0.8) ≈ 2.23 — each additional hundred daily pedestrians more than doubles the survival odds.

A banker can understand this. A regulator can audit this. A business owner can act on this. Try getting the same precision of explanation from a random forest.

One common trap: confusing odds with probability. "Doubles the odds" does NOT mean "doubles the probability." When the baseline probability is 5%, doubling the odds moves it to about 10%. When the baseline is 40%, doubling the odds moves it to about 57%. The conversion is nonlinear. Always specify the baseline when translating odds ratios into probabilities for stakeholders.

Finding the Weights: Why We Must Iterate

Linear regression gave us the normal equation — plug in data, get the answer. Logistic regression has no closed-form solution. The sigmoid makes the gradient equations nonlinear. Set the derivatives to zero and you get a system of equations that can't be solved algebraically. You must iterate.

The workhorse algorithms:

Gradient descent follows the negative gradient of log-loss, stepping downhill each iteration. It scales well to large datasets and is conceptually the simplest approach. For logistic regression's convex loss, any local minimum is the global minimum, so you don't worry about getting stuck in bad optima the way you do with neural networks.

Newton's method uses not only the gradient (which direction is downhill) but also the curvature — the Hessian matrix of second derivatives — to take smarter, larger steps. It converges quadratically, meaning the number of correct digits roughly doubles each iteration. For logistic regression, Newton's method takes the elegant form of Iteratively Reweighted Least Squares (IRLS): at each step, you construct a weighted linear regression problem and solve it. Then you update the weights based on the current predictions and solve again. The fact that logistic regression optimization reduces to repeatedly solving weighted least squares problems is one of those connections that makes the mathematical framework feel deeply coherent. Each iteration costs O(p³) though, so it's impractical for large p.

L-BFGS approximates the Hessian from recent gradient information, giving you most of Newton's speed at a fraction of the memory cost. It's sklearn's default solver, and for most problems, it's the right choice.

Rest Stop

If you've made it here, you can now build and train both regression and classification models from the linear family. You understand the loss functions, the optimization approaches, regularization with its geometric and Bayesian interpretations, and how to interpret coefficients as odds ratios. That's a formidable toolkit.

What follows is the deeper layer: the unified framework that reveals linear and logistic regression as special cases of the same idea, and the production realities that separate a notebook experiment from a deployed model.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

The Bigger Picture: Generalized Linear Models

Here's something that took me a while to appreciate: linear regression and logistic regression aren't two separate models. They're two instances of a single framework called Generalized Linear Models (GLMs), introduced by Nelder and Wedderburn in 1972.

A GLM has three components. The random component specifies the probability distribution of the response variable — Gaussian for continuous data, Bernoulli for binary outcomes, Poisson for counts. The systematic component is the familiar linear predictor η = Xβ — a weighted sum of features. The link function connects the two: it maps the expected value of the response to the linear predictor.

For linear regression, the link function is the identity: g(μ) = μ. The expected value of y equals the linear predictor directly. For logistic regression, the link function is the logit: g(μ) = log(μ / (1 − μ)). The log-odds of the probability equals the linear predictor. For Poisson regression (modeling count data like "number of customer complaints per day"), the link function is the log: g(μ) = log(μ).

Same framework. Same estimation machinery (IRLS works for all GLMs). Different distribution assumption, different link function. Once you internalize this, meeting a new GLM — gamma regression for insurance claims, negative binomial regression for overdispersed counts — isn't learning a new model. It's picking a new row from the table.

The link functions aren't arbitrary either. Each distribution in the exponential family has a canonical link — the link that falls out naturally from the math of the distribution. The identity link for Gaussian, the logit for Bernoulli, the log for Poisson — these are the canonical links. Using them gives the nicest mathematical properties, though you're free to use others if the problem demands it.

My favorite thing about the GLM framework is that it took two models I'd been treating as unrelated for years and revealed them as the same engine with different fuel. That realization changed how I think about modeling — from "which model should I use?" to "what distribution does my response follow, and what's the natural link?"

Production Realities

Feature Scaling Is Non-Negotiable for Regularization

If foot traffic ranges from 50 to 3,000 and rent ranges from $500 to $5,000, the regularizer penalizes the foot traffic coefficient unfairly — it's larger only because the units are smaller. Standardize all features before any regularized model. Use sklearn's Pipeline so the scaler is fitted only on training data and applied consistently to test data.

The C vs. Alpha Confusion

I'll be honest — I still occasionally get tripped up by sklearn's parameterization. In Ridge and Lasso, the regularization strength is alpha (larger = stronger). In LogisticRegression, it's C — the inverse of regularization strength (larger C = weaker regularization). C = 1/λ. The convention comes from SVMs. Memorize this or you'll tune in the wrong direction and wonder why your model gets worse.

Perfect Separation

If a feature or combination of features perfectly predicts the outcome — every shop with foot traffic above 500 survived, every one below 500 failed — the MLE doesn't exist. The optimizer tries to push the coefficient to infinity to make the sigmoid output exactly 0 or 1. Symptoms: absurdly large coefficients, convergence warnings, NaN values. The fix is regularization (add an L2 penalty, which caps the weights) or Firth's penalized likelihood for small-sample settings. This has bitten me in production with rare-event models more than once.

Calibration: The Hidden Superpower

When logistic regression says "this shop has a 70% chance of surviving," it means that among all shops it scores around 70%, roughly 70% actually survive. That's calibration — predicted probabilities match observed frequencies. Logistic regression gets this essentially for free because it optimizes log-loss, which is a proper scoring rule.

Most other classifiers are not naturally calibrated. Random forests push probabilities toward 0.5. SVMs don't output probabilities at all. Naive Bayes produces extreme probabilities. In production — insurance pricing, medical triage, fraud thresholds — you need the probability to mean something. Check calibration with a reliability diagram (predicted vs. actual positive rate in binned buckets) and the Brier score (lower is better).

Class Imbalance

If 95% of shops in your dataset survived and 5% failed, the model will happily predict "survives" for everything and achieve 95% accuracy. Useless. The clean fix is reweighting the loss function:

LogisticRegression(class_weight='balanced')

This upweights the minority class proportionally. It's almost always better than resampling (SMOTE, etc.) for logistic regression because resampling changes your data distribution while reweighting changes your loss function — same effect, less risk of data leakage.

Multiclass Extension

Binary logistic regression handles two classes. For K classes, softmax regression (multinomial logistic regression) generalizes the sigmoid. Each class gets its own weight vector, and the softmax function converts K raw scores into K probabilities that sum to 1: P(class k) = e^(zₖ) / Σⱼ e^(zⱼ). When K = 2, softmax collapses to sigmoid. The loss becomes categorical cross-entropy. The alternative — One-vs-Rest, training K separate binary classifiers — is simpler but produces probabilities that don't sum to 1. Softmax is generally preferred because it jointly optimizes all decision boundaries.

Residual Diagnostics

OLS assumes linearity, independent errors, normally distributed residuals, and constant variance — the LINE assumptions. When they hold, OLS is the Best Linear Unbiased Estimator (the Gauss-Markov theorem). But "unbiased" isn't always optimal. A little bias from regularization often buys a lot of variance reduction, and your test error drops. Always plot residuals versus fitted values. Random scatter is good. Curves mean you're missing a nonlinear relationship. Funnels mean the variance isn't constant — use Weighted Least Squares or robust standard errors. Cook's Distance flags data points that disproportionately influence the fit: D > 4/n deserves investigation.

Wrap-Up

If you're still with me, thank you. I hope it was worth the trip.

We started with the observation that "linear" means linear in parameters, not in features — which makes the framework far more flexible than a straight line. We built a loss function by squaring errors, discovered the normal equation that solves it exactly, and reached for gradient descent when the matrix got too large. We added regularization — first as a practical fix for overfitting, then through the geometric lens that explains why L1 produces zeros and L2 doesn't, and finally through the Bayesian lens that reveals regularization as a statement of prior belief. We extended the same framework to classification with the sigmoid (which the math chose for us, not the other way around), a new loss function, and odds ratios for interpretation. And we zoomed out to see that linear and logistic regression are two views of the same unified framework — Generalized Linear Models.

My hope is that the next time you're in an interview and someone asks "why does Lasso produce sparse solutions," or the next time you're debugging a production model and the coefficients look insane, instead of reaching for a vague half-remembered answer, you'll have a clear mental picture of diamonds and circles, Gaussian and Laplace priors, and the bowl-shaped valley that makes all of this work — having a pretty darn good mental model of what's going on under the hood.

Resources

A curated handful of things I found genuinely useful, not an exhaustive bibliography:

The Elements of Statistical Learning (Hastie, Tibshirani, Friedman) — The O.G. reference. Chapters 3 and 4 cover everything in this section with mathematical rigor. Free PDF from the authors' website. Dense but insightful.
An Introduction to Statistical Learning (ISLR) — The gentler sibling of the above. If ESL feels like drinking from a firehose, start here. Also free.
Zou & Hastie (2005), "Regularization and Variable Selection via the Elastic Net" — The paper that introduced ElasticNet and the grouping effect. Surprisingly readable for a statistics paper.
Nelder & Wedderburn (1972), "Generalized Linear Models" — The paper that unified linear and logistic regression. A landmark in statistical modeling.
sklearn's Linear Models documentation — Wildly practical. Every solver option, every parameter, worked examples. Bookmark it.
Brandon Rohrer's "How Does Linear Regression Work?" — A visual walkthrough that builds the intuition without drowning you in equations. Unforgettable illustrations.

Next → Simple Classifiers