Nice to Know

Chapter 14: Probabilistic & Bayesian ML 10 subtopics

I put off these topics for an embarrassingly long time. Every time I’d see someone mention “do-calculus” or “information geometry” in a paper, I’d nod sagely, close the tab, and go back to tuning hyperparameters. Bayesian nonparametrics? I told myself I’d get to it “next quarter.” That quarter never came. Finally the discomfort of faking comprehension in too many conversations grew unbearable. Here is that dive.

What follows is a collection of advanced Bayesian and probabilistic concepts that sit at the edges of mainstream ML practice. Some of them — like Bayesian deep learning and conformal prediction — are increasingly showing up in production systems. Others — like information geometry and PAC-Bayes bounds — remain mostly in research papers but shape how we think about learning itself. All of them are things a senior practitioner should at least recognize.

Before we start, a heads-up. We’ll be touching on Hessians, KL divergences, causal graphs, and some measure theory. You don’t need any of it beforehand. We’ll add what we need, one piece at a time.

This isn’t a short journey, but I hope you’ll be glad you came.

Bayesian Deep Learning
MC Dropout
Bayes by Backprop
Laplace Approximation
Expectation Propagation
Causal Inference and Pearl’s Framework
The do-operator and do-calculus
Counterfactuals
Instrumental Variables
Rest Stop
Bayesian Nonparametrics
The Dirichlet Process
Information Geometry
PAC-Bayes Bounds
Conformal Prediction
Resources

Bayesian Deep Learning

Here’s the uncomfortable truth about standard neural networks: they give you a number and no indication of whether that number is trustworthy. Ask a classifier trained on cats and dogs to classify a photograph of a toaster, and it will confidently tell you “dog, 92%.” It has no mechanism for saying “I have absolutely no idea what this is.”

Imagine you’re building a diagnostic model for a small rural clinic. The model sees chest X-rays and flags potential tuberculosis cases. It was trained on data from large urban hospitals. When a patient walks in with an unusual presentation the model has never encountered, you don’t want a confident wrong answer. You want the model to raise its hand and say “this one needs a human.” That’s the promise of Bayesian deep learning: neural networks that carry uncertainty through every prediction.

The core idea is seductive. Instead of learning a single fixed value for each weight, learn a distribution over weights. Instead of one prediction, average across all plausible weight configurations. The spread of those predictions tells you how confident the model is.

The core challenge is equally clear. A ResNet-50 has about 25 million weights. Maintaining a full probability distribution over each one is astronomically expensive. Nobody does that in practice. What people do instead is find clever approximations, and three of them have become the workhorses of the field.

MC Dropout

Let’s go back to our clinic. You already trained a standard network with dropout — the technique where you randomly zero out neurons during training to prevent overfitting. Normally, you turn dropout off at test time. MC Dropout says: don’t.

Keep dropout on. Run the same X-ray through the network 50 times. Each forward pass randomly drops different neurons, so you get 50 slightly different predictions. If all 50 agree — “tuberculosis, tuberculosis, tuberculosis” — the model is confident. If they’re scattered — “tuberculosis, pneumonia, normal, tuberculosis, normal” — the model is uncertain. The mean of the 50 predictions is your estimate. The standard deviation is your uncertainty.

Gal and Ghahramani showed in 2016 that this is mathematically equivalent to a specific form of variational inference. That’s a fancy way of saying it’s not a hack — there’s a theoretical justification for why treating dropout-as-uncertainty works. The prediction spread you get approximates what you’d get from a proper Bayesian posterior, at least roughly.

The beauty of MC Dropout is that it’s nearly free. You already have dropout in your network. You already know how to run a forward pass. The only cost is running it multiple times instead of once. For our clinic, that might mean the X-ray takes 2 seconds instead of 0.04 seconds. For a life-or-death diagnosis, that’s a bargain.

The limitation is that the quality of the uncertainty depends heavily on dropout rate and placement. Too little dropout and every forward pass gives the same answer — your uncertainty estimate flatlines. The approximation is also known to be overly confident on some out-of-distribution inputs. It’s a good first step, not the final word.

Bayes by Backprop

If MC Dropout is the “bolt it on after training” approach, Bayes by Backprop is the “bake it in from the start” approach. Blundell et al. (2015) proposed replacing every fixed weight with a Gaussian distribution parameterized by a mean and a standard deviation. During each forward pass, you sample a concrete weight from that Gaussian using what’s called the reparameterization trick — a way of expressing the random sampling so that gradients can flow through it during backpropagation.

The loss function changes too. Instead of minimizing prediction error alone, you minimize prediction error plus a KL divergence term that measures how far your learned weight distributions have drifted from some prior (typically a simple Gaussian). This is variational inference applied directly to neural network weights.

For our clinic model, Bayes by Backprop would give richer, more principled uncertainty estimates than MC Dropout. The tradeoff is real: training takes roughly twice as long (you’re learning two parameters per weight instead of one), the optimization landscape is harder to navigate, and there are more hyperparameters to tune. In practice, most teams reach for MC Dropout first and only move to Bayes by Backprop when MC Dropout’s uncertainty estimates aren’t good enough.

Laplace Approximation

The Laplace approximation is the “I already trained my network, can I please add uncertainty without retraining?” answer. And the answer is yes.

Here’s the idea. You train your network the normal way, arriving at a single set of weights — the MAP (maximum a posteriori) estimate, the peak of the posterior landscape. Now imagine you’re standing at that peak, looking at the curvature of the terrain around you. If the peak is sharp and narrow, small changes in weights change the loss dramatically — the model is very sensitive, and uncertainty should be low in directions where the peak is steep. If the terrain is flat in some direction, the model’s predictions don’t change much when you move that way — the model is uncertain about those weight values.

Mathematically, you fit a Gaussian centered at your trained weights, and the shape of that Gaussian is determined by the Hessian — the matrix of second derivatives of the loss function evaluated at those weights. The Hessian captures exactly that curvature. A large second derivative means a sharp peak (low uncertainty). A small second derivative means a flat peak (high uncertainty). The inverse of the Hessian becomes the covariance matrix of your Gaussian posterior approximation.

For a 25-million-parameter network, computing the full Hessian is out of the question — it would be a 25-million-by-25-million matrix. So people use approximations: a diagonal approximation (pretend each weight is independent — fast but crude), a Kronecker-factored approximation (KFAC — captures some correlations between weights in the same layer), or a low-rank approximation (keep only the most important directions of curvature).

The Laplace approach shines in production settings where retraining is expensive. The laplace-torch library for PyTorch makes it almost plug-and-play: train your model, wrap it in a Laplace object, call .fit(), and you have uncertainty estimates. For our clinic, this means we could take any existing X-ray model, bolt on uncertainty, and deploy it without touching the training pipeline.

The downside is that it only works well when the posterior is approximately Gaussian near the peak. If the true posterior is multimodal — several distinct good weight configurations exist — the Laplace approximation misses all but one. Think of it like describing a mountain range by the shape of a single summit. If there’s one dominant peak, great. If there are three peaks of similar height, you’re missing the story.

Expectation Propagation

In the previous chapter sections we spent time with variational inference — the workhorse of approximate Bayesian computation. VI works by finding a simple distribution q that’s close to the true posterior p. “Close” here means minimizing KL(q || p), which has a particular personality: it’s mode-seeking. When the true posterior has two humps and q can only have one, VI will pick one hump and plant itself there, ignoring the other entirely.

Think of it this way. You’re trying to cover an oddly shaped puddle with a circular tarp. KL(q || p) says: make sure the tarp doesn’t extend beyond the puddle. The result? The tarp shrinks to cover one region perfectly and abandons the rest. Expectation Propagation (EP), introduced by Tom Minka in 2001, flips the divergence: it minimizes KL(p || q), which is mass-covering. Now the instruction is: make sure the puddle doesn’t extend beyond the tarp. The tarp stretches to cover everything, even if it’s a loose fit.

EP works by breaking the posterior into factors — one for each data point or likelihood term — and approximating each factor with a simple distribution, then stitching them back together. It iterates, refining each factor’s approximation while keeping the others fixed. This message-passing structure makes EP natural for models that factor into local terms, like Gaussian processes and graphical models.

EP powered the sparse GP approximations in GPflow and was central to Microsoft’s Infer.NET framework. It tends to be less popular than VI or MCMC in day-to-day practice, partly because convergence isn’t guaranteed — the iterations can oscillate. But when your posterior has multiple modes and you need the approximation to acknowledge all of them, EP is the tool to reach for.

I’ll be honest — I underestimated EP for years, treating it as a niche curiosity. Then I encountered a GP classification problem where VI’s uncertainty estimates were dangerously overconfident, and EP gave a much more calibrated answer. That was the moment it clicked: the direction of the KL divergence isn’t a mathematical detail. It’s a choice about what kind of mistakes you’re willing to make.

Causal Inference and Pearl’s Framework

Everything in ML up to this point has been about patterns in data. Correlations. Associations. “Patients who take drug X tend to recover faster.” But does drug X cause recovery? Or do healthier patients choose to take it? This is the question that separates statistics from science, and for decades it was considered unanswerable from observational data alone.

Judea Pearl, a computer scientist who received the Turing Award for this work, built a formal framework that makes causal reasoning rigorous. His insight was to represent causal relationships as directed graphs — causal diagrams or DAGs (directed acyclic graphs) — where an arrow from A to B means A directly causes B. These diagrams encode not what the data shows, but what we believe about how the world works.

Pearl organized causal reasoning into three rungs of a ladder. The first rung is association: what patterns exist in the data? This is standard ML. The second rung is intervention: what happens if I do something? This is where experiments live. The third rung is counterfactual: what would have happened if things had been different? This is the realm of individual-level “what-if” questions.

I still occasionally mix up which rung a question belongs to. It helps to have a concrete example running through all three.

The do-operator and do-calculus

Let’s go back to our clinic. We observe that patients who receive a new drug recover at higher rates. In probability notation, P(recovery | drug) is high. But wait — maybe sicker patients are less likely to be prescribed the drug (the doctor reserves it for mild cases). What we really want is P(recovery | do(drug)) — the probability of recovery if we force every patient to take the drug, regardless of their severity.

The do-operator, written do(X = x), represents an intervention: we reach into the system and set X to a particular value, breaking all the arrows that normally point into X. It’s the difference between seeing that the barometer dropped (which predicts rain) and smashing the barometer to a low reading (which does not cause rain).

Do-calculus is a set of three algebraic rules that Pearl proved are complete — meaning if it’s possible to compute a causal effect from observational data plus a causal diagram, these three rules will get you there. The rules tell you when you can swap, insert, or remove do-operators in probability expressions. Each rule corresponds to a graphical condition on the DAG: whether certain paths are blocked when you remove or add edges.

The practical payoff is enormous. In many real-world scenarios, running a randomized controlled trial is impossible or unethical. Do-calculus lets you determine whether the causal question can be answered from observational data alone, given your assumptions about the causal structure. If the answer is yes, it tells you exactly which adjustment formula to use.

Counterfactuals

Counterfactuals are Pearl’s third rung, and they’re the most philosophically loaded. “If this specific patient had taken the drug, would they have recovered?” This isn’t a population-level question. It’s about one individual, one alternate timeline.

The mechanics work in three steps. First, use the observed data about the patient to abduct — infer the values of any unobserved variables (like the patient’s latent health state). Second, intervene — change the treatment in the model. Third, predict — run the modified model forward to see what would have happened. It’s running a simulation of reality, then replaying it with one variable changed.

Counterfactuals are essential for attribution (“was the drug responsible for recovery?”), fairness (“would this loan have been approved if the applicant had been a different gender?”), and legal reasoning (“would the harm have occurred without the defendant’s action?”). They’re also deeply dependent on the causal model being correct. If your diagram is wrong, your counterfactual is fiction.

Instrumental Variables

Sometimes you can’t run an experiment, and the causal effect you care about is tangled up with confounders you can’t measure. Instrumental variables (IVs) offer a clever back door — or more precisely, a front door around the confounder.

An instrument Z must satisfy three conditions: it affects the treatment X, it affects the outcome Y only through X, and it’s independent of the unmeasured confounders. The classic example: distance to a hospital as an instrument for whether a patient receives surgery. Patients closer to the hospital are more likely to get surgery (relevance), but distance itself doesn’t directly affect health outcomes except through the surgery decision (exclusion).

Back at our clinic, imagine a regional policy change mandates that certain clinics stock the new drug while others don’t. Whether a patient’s nearest clinic stocks the drug is an instrument: it affects whether they receive the drug but doesn’t directly affect their health. Using this instrument, we can estimate the causal effect of the drug even without a randomized trial.

Finding good instruments is notoriously difficult. The exclusion restriction — that the instrument only affects the outcome through the treatment — is fundamentally untestable. It requires domain knowledge and honest argumentation. I’ve seen papers build entire careers on a single clever instrumental variable, and I’ve seen published instruments demolished by a convincing counter-argument about violation of the exclusion restriction.

Rest Stop

If you’ve made it this far, take a breath. You can stop here if you want.

You now have a working understanding of three approaches to putting uncertainty into neural networks (MC Dropout, Bayes by Backprop, Laplace approximation), an alternative inference method that handles multimodal posteriors better than VI (expectation propagation), and the formal framework for moving from correlation to causation (Pearl’s causal inference, including do-calculus, counterfactuals, and instrumental variables). That’s a genuinely useful toolkit. If someone mentions any of these in a meeting, you can hold your own.

What comes next ventures further into the theoretical foundations. Bayesian nonparametrics deals with the “I don’t even know how many clusters there are” problem. Information geometry asks what it means to measure distance between probability distributions. PAC-Bayes bounds give theoretical guarantees for why Bayesian methods generalize. And conformal prediction offers distribution-free uncertainty that comes with an actual mathematical warranty.

If that nagging feeling of “but what’s underneath?” has you hooked, read on.

Bayesian Nonparametrics

Every clustering algorithm you’ve used asks you the same annoying question: “How many clusters?” K-means needs K. Gaussian mixture models need K. You try K=3, K=5, K=10, stare at elbow plots, compute silhouette scores, and eventually pick a number that feels right. The whole time, there’s this nagging sense that the data should be telling you how many clusters there are, not the other way around.

Bayesian nonparametrics formalizes that feeling. The “nonparametric” doesn’t mean “no parameters” — it means the number of parameters grows with the data. As you see more data, the model is free to invent new clusters if the data demands it.

The Dirichlet Process

The Dirichlet Process (DP) is the mathematical engine behind this. It’s a distribution over distributions — a prior you place not on a parameter value, but on an entire probability distribution. That sounds abstract, so let’s make it concrete with the Chinese Restaurant Process, the most famous metaphor in Bayesian statistics.

Imagine a restaurant with infinitely many tables. The first customer walks in and sits at table 1. The second customer either joins table 1 (probability proportional to the number of people already there — currently 1) or starts a new table 2 (probability proportional to a concentration parameter α). The third customer faces the same choice: join table 1, join table 2, or start table 3, with probabilities weighted by current occupancy and α.

Two things happen naturally. Popular tables get more popular — a “rich get richer” dynamic. But new tables keep appearing, just at a decreasing rate. The concentration parameter α controls the balance: large α means new tables appear frequently (many small clusters), small α means customers pile onto existing tables (few large clusters). The final number of tables is never fixed in advance. It emerges from the data.

In practice, a Dirichlet Process Mixture Model (DPMM) uses this as a prior for a mixture of Gaussians (or any other distribution family). The result is a clustering algorithm that infers the number of clusters alongside the cluster parameters. It’s beautiful in theory. In practice, fitting DPMMs is computationally expensive and can be sensitive to the choice of α and the base distribution. For most production clustering tasks, running K-means for K=2 through K=20 and picking the best one is faster and more stable. The DP shines when the number of clusters is scientifically meaningful — how many distinct cell types exist in this tissue sample? how many topics recur in this corpus? — rather than a hyperparameter to optimize.

There’s a sibling called the Indian Buffet Process (IBP) that does for features what the DP does for clusters. Instead of assigning each data point to one cluster, the IBP lets each data point have any number of latent features, and the total number of features grows with the data. It’s used in latent factor models when you don’t know how many underlying factors exist.

Information Geometry

I won’t pretend I have deep intuition for Riemannian manifolds. But I can share the core idea that made information geometry click for me, and it starts with a question: what does it mean for two probability distributions to be “close”?

If you think of each probability distribution as a point, the space of all distributions forms a kind of landscape — a statistical manifold. On a flat landscape, moving one meter north changes the terrain the same way everywhere. But the landscape of probability distributions isn’t flat. Near some distributions, a tiny change in parameters causes a huge shift in predicted probabilities. Near others, a large parameter change barely matters.

The Fisher information metric is the natural way to measure distances on this curved landscape. It’s based on the Fisher information matrix, which captures how sensitive the likelihood is to parameter changes. High Fisher information in some direction means the data is very informative about that parameter — the landscape is steep. Low Fisher information means the data doesn’t care about that parameter — the landscape is flat.

This directly connects to optimization. Standard gradient descent treats the parameter space as flat — every direction gets the same treatment. Natural gradient descent uses the Fisher information metric to adjust step sizes based on the actual curvature of the distribution space. It’s like the difference between navigating with a flat map versus a topographic map. On the flat map, you might take a huge step in a direction that barely changes your predictions, and a tiny step in a direction that matters enormously. The natural gradient corrects for this.

The update rule is elegant: instead of θ′ = θ − η ∇L, you use θ′ = θ − η F⁻¹∇L, where F is the Fisher information matrix. In practice, computing F⁻¹ for a large network is expensive, which is why approximations like K-FAC (Kronecker-Factored Approximate Curvature) exist. The natural gradient is the theoretical ideal that algorithms like Adam are rough approximations of.

Information geometry also gives us a deeper understanding of KL divergence. The KL divergence between two nearby distributions, it turns out, is approximately half the squared distance measured by the Fisher metric. It’s the infinitesimal version of the same idea. This is why KL divergence isn’t symmetric — the curvature of the landscape isn’t the same in both directions.

PAC-Bayes Bounds

Here’s a question that has haunted learning theory: why do overparameterized neural networks generalize at all? Classical theory says a model with more parameters than data points should memorize the training set and fail on new data. Yet deep networks routinely have millions of parameters trained on thousands of examples and still generalize. The classical bounds are vacuous — they predict generalization gaps larger than 100%, which is useless.

PAC-Bayes bounds offer some of the tightest non-vacuous generalization guarantees we have for stochastic and Bayesian learners. PAC stands for “Probably Approximately Correct” — the framework asks how likely it is that a learning algorithm produces a hypothesis that is approximately correct on new data.

The key insight is beautiful in its simplicity. You start with a prior distribution P over hypotheses — your belief before seeing data. After training, you have a posterior distribution Q. The PAC-Bayes bound says: the generalization gap of a model drawn from Q is bounded by the training error plus a complexity term proportional to KL(Q || P) divided by the square root of the number of training examples.

Read that again. The penalty for complexity isn’t the number of parameters. It’s how far your posterior moved from your prior. If you start with a reasonable prior and training doesn’t push your weights too far away, the bound stays tight — regardless of how many parameters you have. This is one of the few theoretical tools that starts to explain why deep learning works.

I’ll be honest — the math can feel impenetrable the first time through. But the intuition is almost common-sense: if you didn’t have to change your mind much to fit the data, you probably haven’t overfit. The more you had to contort your beliefs, the less you should trust the result. PAC-Bayes makes that intuition precise.

Recent work has produced PAC-Bayes bounds that are actually numerically useful — tight enough to predict test accuracy within a few percentage points for some architectures. They’re also being used for model selection: choose the model whose PAC-Bayes bound is tightest.

Conformal Prediction

Everything we’ve discussed so far requires assumptions. Bayesian methods assume a prior. Parametric methods assume a model family. Even the PAC-Bayes bound assumes you chose the prior before seeing data. Conformal prediction asks: can we get valid uncertainty quantification with essentially no assumptions at all?

The answer, remarkably, is yes.

Here’s the setup. You have any trained model — a random forest, a neural network, a linear regression, whatever. You have a held-out calibration set that wasn’t used for training. For each example in the calibration set, you compute a nonconformity score — a measure of how “strange” the example is relative to the model’s prediction. For regression, this could be the absolute residual |y − ŷ|. For classification, it could be 1 minus the predicted probability of the true class.

Now a new example arrives. You want a 90% prediction interval. You sort the calibration scores, find the 90th percentile, and that threshold defines your interval. For regression: the prediction ± the threshold. For classification: include all classes whose predicted probability exceeds the threshold.

The guarantee: if the calibration data and future data come from the same distribution (the exchangeability assumption — weaker than independence), then the true label falls inside the prediction set with exactly the coverage probability you specified. Not approximately. Not asymptotically. Exactly, in a finite-sample sense.

Back to our clinic. You trained a model to predict patient blood pressure. On the calibration set, you compute residuals. For a new patient, conformal prediction says: “I predict 130 mmHg, and I guarantee that 90% of the time, the true value falls in [118, 142].” The guarantee holds regardless of whether the model is a neural network or a linear regression, regardless of whether the data is Gaussian or heavy-tailed, regardless of whether the model is well-specified or completely wrong. The only thing that matters is that the new patient comes from the same distribution as the calibration patients.

That’s an extraordinary property. No other uncertainty quantification method offers this kind of distribution-free, model-agnostic guarantee.

The catch is that conformal prediction tells you nothing about whether the intervals are tight. A terrible model will produce wide intervals (which is honest). A good model will produce narrow ones (which is useful). Conformal prediction guarantees coverage, not informativeness. But in settings where you need a guarantee — regulatory compliance, safety-critical systems, clinical trials — having a mathematically airtight coverage promise is worth a lot.

Conformal prediction has been quietly growing in popularity since about 2020, with libraries like MAPIE (Python) making it accessible. It’s one of those tools that, once you understand it, you start seeing opportunities for it everywhere.

Wrap-Up

If you’re still with me, thank you. I hope it was worth the detour into the deeper corners of Bayesian and probabilistic ML.

We started with neural networks that admit uncertainty — MC Dropout as the quick retrofit, Bayes by Backprop as the principled approach, and Laplace approximation as the post-hoc solution. We saw how expectation propagation covers multi-modal posteriors that VI would miss. We climbed Pearl’s causal ladder from association to intervention to counterfactual, armed with do-calculus and instrumental variables. We let the Dirichlet Process decide how many clusters we need. We walked the curved landscape of information geometry. We found that PAC-Bayes bounds explain why overparameterized networks generalize. And we ended with conformal prediction — uncertainty with a mathematical guarantee.

My hope is that the next time you see one of these terms in a paper or hear it in a meeting, instead of nodding sagely and closing the tab, you’ll have a genuine mental model of what’s going on under the hood. And maybe, when the problem calls for it, you’ll know exactly which tool to reach for.

Resources

Gal & Ghahramani, “Dropout as a Bayesian Approximation” (2016) — The paper that turned a regularization hack into a legitimate uncertainty tool. Wildly influential.
Judea Pearl, The Book of Why (2018) — Pearl’s accessible introduction to causal inference for a general audience. If you read one thing on causality, make it this.
Immer et al., “Improving Predictions of Bayesian Neural Nets via the Laplace Approximation” (2021) — The modern treatment of post-hoc Laplace for deep learning, plus the laplace-torch library.
Shafer & Vovk, “A Tutorial on Conformal Prediction” (2008) — The original accessible tutorial. Older but unforgettable in its clarity.
Amari, Information Geometry and Its Applications (2016) — The definitive treatment from the person who defined the field. Dense but rewarding for the mathematically inclined.
McAllester, “PAC-Bayesian Stochastic Model Selection” (1999) — The O.G. paper on PAC-Bayes bounds. Short and surprisingly readable for a theory paper.

← Previous Gaussian Processes & Bayesian Optimization Next → Ch 15: Generative Models