Probability & Statistics
Probability is the language your ML models are actually thinking in — you may not realize it, but every prediction is a distribution and every loss function is a negative log-likelihood wearing a costume. Bayes' theorem tells you how to update beliefs as data arrives, and it runs everything from spam filters to GPT. Distributions are your toolkit: learn which one to reach for and you stop being someone who calls model.fit() and starts being someone who understands what happens inside. MLE derives your loss functions. MAP gives you regularization. The CLT explains why SGD works. This section is the probabilistic backbone that holds the rest of the book together.
The Topic I Kept Putting Off
I'll be honest with you. I avoided probability for a long time.
I could train models. I could tune hyperparameters. I could read papers and nod along at the math. But when someone asked me to derive a loss function from first principles, or explain why L2 regularization corresponds to a Gaussian prior, I'd change the subject. I had memorized formulas without understanding where they came from. That worked fine until it didn't — until a colleague asked me why our fraud detector's 99.5% accuracy was meaningless, and I couldn't explain it without hand-waving.
So I sat down and actually learned this material. Not from a textbook cover-to-cover — I tried that and bounced off. I learned it the way practitioners learn things: by running into a wall, backing up, finding the piece of math that explains the wall, and then moving forward again. That's what we're going to do here.
Probability is the mathematics of uncertainty. Statistics is the practice of drawing conclusions from data in the face of that uncertainty. Together, they form the foundation that ML is built on. Every model output is a probability distribution. Every loss function is a likelihood in disguise. Every regularization term is a prior belief about parameters. Every training run is an optimization over a probabilistic objective. You can use ML without knowing any of this — plenty of people do — but you will hit ceilings that you cannot break through without it.
A heads-up on prerequisites: we'll assume comfort with basic Python, NumPy, and the mathematical notation from earlier sections. If you know what a derivative is and what a matrix does, you're ready. Everything else we define as we go.
We're going to thread a single running example through this entire section: building a spam filter. We'll start with three emails and gradually make it smarter. Every concept we introduce earns its place by making our filter better. By the end, you'll see how a few probability rules connect to loss functions, regularization, Bayesian inference, and even the architecture of GPT. Let's go.
What we'll cover:
- The Language of Uncertainty — sample spaces, events, and two philosophies
- Conditional Probability — where the mistakes live
- Bayes' Theorem — the one equation
- Rest Stop — permission to breathe
- Random Variables — turning events into numbers
- The Distribution Toolkit — which one and why
- Expectation, Variance, and Friends
- The Two Theorems That Make ML Work
- MLE and MAP — where loss functions come from
- Bayesian Inference — the full picture
- Hypothesis Testing and A/B Testing
- Descriptive Statistics and Resampling
- Concentration Inequalities
- Gaussian Mixture Models
- Causal Inference
- Wrap-Up
The Language of Uncertainty
Let's start with three emails sitting in an inbox. One is from a friend. One is a newsletter. One is trying to sell discount pharmaceuticals. We want to build a classifier that labels each one as spam or not-spam. To do that, we need a language for talking about uncertain outcomes — and that language is probability.
Sample Space, Events, and Probabilities
The sample space Ω is the set of all things that could happen. For a single email, our sample space is {spam, not-spam}. For three emails, it's every possible combination of labels — eight possibilities in total. An event is any subset of the sample space: "at least one email is spam" is an event, "all three are spam" is another.
Probability assigns a number between 0 and 1 to each event. Three axioms govern everything: probabilities are non-negative, the probability of the entire sample space is 1, and the probability of mutually exclusive events adds up. That's it. Every theorem we'll see — Bayes' theorem, the law of total probability, everything — is a consequence of these three rules. The entire edifice stands on a foundation you can state in one sentence.
If we know that roughly 45% of all email is spam (a real-world estimate), we can assign P(spam) = 0.45 for a random email. But what does that number mean? That depends on who you ask.
Two Philosophies: Frequentist and Bayesian
A frequentist says P(spam) = 0.45 means: if you drew a billion random emails, about 45% would be spam. Probability is a long-run frequency. You don't get to assign probabilities to one-time events — asking "what's the probability this specific email is spam?" doesn't make sense in a strict frequentist worldview because it either is or it isn't.
A Bayesian says P(spam) = 0.45 means: given everything I know right now, my degree of belief that this email is spam is 0.45. Probability is a state of knowledge, not a physical frequency. This lets you reason about one-off events, update beliefs as evidence arrives, and put probability distributions over model parameters.
In ML, the divide matters. MLE is frequentist — it finds parameters that best explain the observed data, full stop. MAP and full Bayesian inference are Bayesian — they fold in prior beliefs and produce distributions over parameters. Most practitioners are pragmatic: they use whichever interpretation produces a better model. I lean Bayesian because it gives me a natural framework for regularization and uncertainty, but I use MLE every day. The math is the same either way.
Back to our spam filter. We have three emails. We've observed some features: whether the email contains the word "free," whether it has an attachment, whether the sender is in our contacts. Now we need to talk about what happens when we have partial information — when we know something about the email and want to update our beliefs. That's conditional probability.
Conditional Probability — Where the Mistakes Live
Suppose we know an email contains the word "free." Does that change our belief about whether it's spam? It should. Emails with "free" are more likely to be spam than emails without it. Conditional probability formalizes this: P(A|B) is the probability of A given that B has happened.
The definition: P(A|B) = P(A ∩ B) / P(B). Read it as: out of all the scenarios where B occurred, what fraction also had A? The denominator P(B) restricts the universe to only the outcomes where B is true. It's a simple ratio, but it's the workhorse of all probabilistic reasoning.
For our spam filter: P(spam | "free") = P(spam AND "free") / P("free"). If 40% of all emails are spam-with-"free" and 50% of all emails contain "free," then P(spam | "free") = 0.40 / 0.50 = 0.80. Knowing the email contains "free" pushed our spam belief from 0.45 to 0.80. That's conditioning in action.
The Trap: P(A|B) ≠ P(B|A)
I'll be honest — I confused P(A|B) with P(B|A) for years. Not in a "I don't know the definitions" way. In a "my gut instinct keeps swapping them" way. And I'm not alone. This confusion has a name: the prosecutor's fallacy, and it has sent innocent people to prison.
P(contains "free" | spam) is high — spammers love the word "free." But P(spam | contains "free") is a different question entirely. Plenty of legitimate emails also use the word "free" — free trials, free shipping, "feel free to reach out." The probability of seeing a feature given a class is the likelihood. The probability of a class given a feature is the posterior. These are not the same thing, and Bayes' theorem is what converts between them.
In ML terms: P(data | model) is the likelihood — easy to compute from your model. P(model | data) is the posterior — what you actually want. Every time you train a model, you're wrestling with this distinction.
Independence and Conditional Independence
Two events A and B are independent if knowing one tells you nothing about the other: P(A|B) = P(A), or equivalently P(A ∩ B) = P(A) · P(B). Flipping two coins gives independent events. Knowing the first landed heads tells you nothing about the second.
ML uses independence assumptions aggressively. Naive Bayes — the algorithm behind many early spam filters — assumes all features are conditionally independent given the class label. That is: P(has "free", has attachment, not in contacts | spam) = P(has "free" | spam) · P(has attachment | spam) · P(not in contacts | spam). This is a wild assumption. The word "free" and having an attachment are absolutely correlated in spam. Yet Naive Bayes works surprisingly well because it gets the classification boundary roughly right even when the probability estimates are garbage.
Conditional independence is subtler: A and B are conditionally independent given C if P(A, B | C) = P(A|C) · P(B|C). Two symptoms might be correlated in the general population but conditionally independent given the disease. Bayesian networks are built entirely on conditional independence — they factor enormous joint distributions into small, tractable pieces.
The Chain Rule and Autoregressive Models
The chain rule of probability decomposes any joint distribution into a product of conditionals: P(A, B, C) = P(A) · P(B|A) · P(C|A, B). For n variables: P(x₁, x₂, ..., xₙ) = P(x₁) · P(x₂|x₁) · P(x₃|x₁,x₂) · ... · P(xₙ|x₁,...,xₙ₋₁). This isn't an assumption — it's a mathematical identity. It always holds.
This is exactly how GPT works. It factors the probability of a text sequence as: P(token₁) · P(token₂|token₁) · P(token₃|token₁,token₂) · ... Each token is predicted conditioned on all previous tokens. The chain rule guarantees this factorization is exact — no information lost. The model's job is to estimate each conditional well. WaveNet does the same for audio. PixelCNN for images. Autoregressive models are the chain rule turned into neural network architecture.
Our spam filter could use the chain rule too. Instead of assuming features are independent (Naive Bayes), we could model P(has "free", has attachment, not in contacts | spam) = P(has "free" | spam) · P(has attachment | has "free", spam) · P(not in contacts | has "free", has attachment, spam). That's more accurate, but we need exponentially more data to estimate all those conditionals. The Naive Bayes independence assumption is wrong, but it keeps us from drowning in data requirements. Every modeling choice is a tradeoff like this.
Bayes' Theorem — The One Equation
We've been building to this. We know P(A|B) = P(A ∩ B) / P(B). We also know P(B|A) = P(A ∩ B) / P(A). Both expressions share P(A ∩ B) in the numerator. Set them equal and rearrange:
P(A|B) = P(B|A) · P(A) / P(B)
That's Bayes' theorem. It tells you how to flip a conditional — how to go from P(B|A) to P(A|B). In ML notation, with parameters θ and data D:
P(θ|D) = P(D|θ) · P(θ) / P(D)
| Term | Name | Role |
|---|---|---|
| P(θ|D) | Posterior | What you believe about θ after seeing data |
| P(D|θ) | Likelihood | How probable the data is under a specific θ |
| P(θ) | Prior | What you believed about θ before seeing data |
| P(D) | Evidence (marginal likelihood) | Normalizing constant — sum of likelihood × prior over all θ |
The Courtroom Analogy
Think of a courtroom. The prior is "innocent until proven guilty" — your starting assumption before any evidence is presented. The likelihood is the testimony: given that the defendant committed the crime, how probable is this evidence? The posterior is the jury's verdict: given all the evidence, how confident are we in guilt? Bayes' theorem is the rational process of moving from the presumption of innocence to a verdict, weighted by the strength of the evidence.
With little evidence, the prior dominates — the jury should lean toward innocence. With overwhelming evidence, the likelihood dominates — the data overwhelms the prior. This is exactly the behavior we want in ML: be cautious with small datasets, be data-driven with large ones.
I keep coming back to this analogy. Weather forecasting works the same way: your prior is yesterday's forecast, the evidence is today's satellite data, and the posterior is the updated forecast. Every time you check the weather app and see the rain probability change from 30% to 70% after new radar data, you're watching Bayes' theorem in action.
The Medical Test Example (Base Rate Fallacy)
This is the example that made Bayes' theorem click for me — and the result genuinely shocked me the first time I worked through it.
A disease affects 1 in 1,000 people. A test has 99% sensitivity (P(positive|disease) = 0.99) and 95% specificity (P(negative|no disease) = 0.95). You test positive. What's the probability you actually have the disease?
Most people — doctors included, studies have shown — guess 95% or 99%. The real answer is about 2%.
import numpy as np
# Bayes' theorem: the medical test that humbles everyone
prevalence = 0.001 # P(disease) — the base rate
sensitivity = 0.99 # P(positive | disease) — true positive rate
specificity = 0.95 # P(negative | no disease)
false_positive_rate = 1 - specificity # P(positive | no disease) = 0.05
# Law of total probability: P(positive)
p_positive = (sensitivity * prevalence) + (false_positive_rate * (1 - prevalence))
# Bayes' theorem: P(disease | positive)
p_disease_given_positive = (sensitivity * prevalence) / p_positive
print(f"P(positive) = {p_positive:.5f}")
print(f"P(disease|positive) = {p_disease_given_positive:.4f}")
# P(disease|positive) ≈ 0.0194 — less than 2%!
# The counting approach makes it vivid:
n = 100_000
sick = int(n * prevalence) # 100 actually sick
healthy = n - sick # 99,900 healthy
true_positives = int(sick * sensitivity) # 99 caught correctly
false_positives = int(healthy * false_positive_rate) # 4,995 false alarms
print(f"\nOut of {n:,} people:")
print(f" Actually sick who test positive: {true_positives}")
print(f" Healthy who test positive: {false_positives}")
print(f" Total positive tests: {true_positives + false_positives}")
print(f" Chance you're actually sick: {true_positives/(true_positives+false_positives):.1%}")
The intuition: the disease is so rare that the 5% false positive rate on the huge healthy population generates far more positive tests than the 99% true positive rate on the tiny sick population. The false alarms swamp the real cases. The base rate (prior prevalence) matters enormously, and ignoring it is called the base rate fallacy.
Now bring it back to our spam filter. If only 0.1% of emails are spam in some corporate inbox, a filter with 99% precision still generates mountains of false positives. This is why class imbalance destroys naive accuracy metrics. A model that predicts "not spam" for everything hits 99.9% accuracy. Precision and recall exist to combat exactly this problem — they are Bayesian reasoning applied to classifier evaluation.
The Law of Total Probability
We used it above without naming it. If events B₁, B₂, ..., Bₙ partition the sample space (they're mutually exclusive and exhaustive), then for any event A:
P(A) = P(A|B₁)·P(B₁) + P(A|B₂)·P(B₂) + ... + P(A|Bₙ)·P(Bₙ)
We computed P(positive) by summing over two cases: sick and healthy. That's the law of total probability. In ML, this shows up every time you marginalize: P(y) = Σₓ P(y|x)·P(x). Every time you sum out a latent variable — in EM, in VAEs, in mixture models — you're applying this law.
Rest Stop
Take a breath. We've covered a lot of ground.
If you stopped right here, you'd know: what probability is and the two philosophies behind it. How to compute conditional probabilities and avoid the deadly P(A|B) ≠ P(B|A) swap. Bayes' theorem and why base rates matter. The chain rule that powers autoregressive models like GPT. The law of total probability that drives marginalization in every latent variable model.
That's a genuine foundation. If you need to take a break, take a break. Everything after this builds on what you've already learned, but nothing before this point requires what comes next. You have a complete, self-contained toolkit for reasoning under uncertainty.
What's ahead: we'll turn events into numbers (random variables), meet the family of distributions that every model assumes, see where loss functions and regularization really come from, and learn to test whether our model improvements are real or noise. The second half is where the ML connections get dense. Ready? Let's keep going.
Random Variables — Turning Events into Numbers
Our spam filter currently thinks in events: "this email is spam." But to do computation — to optimize, to differentiate, to train a model — we need numbers. A random variable is the bridge: it's a function that assigns a number to each outcome. X = 1 if spam, X = 0 if not-spam. Now we can do math.
Discrete vs. Continuous
Discrete random variables take countable values: the number of spam emails in a batch, the number of heads in 10 coin flips. Continuous random variables take uncountably many values in an interval: the spam probability output of a logistic regression model (any number in [0, 1]), or the loss on a single training example (any non-negative real number).
PMF, PDF, CDF
A discrete random variable is described by its probability mass function (PMF): P(X = x) for each possible value x. Values are non-negative and sum to one. Our spam/not-spam variable has a two-point PMF: P(X=1) = 0.45, P(X=0) = 0.55.
A continuous random variable is described by its probability density function (PDF): f(x). Here's the thing that tripped me up early on — the PDF is not a probability. It can be greater than 1. Probabilities come from integrating the PDF over an interval: P(a < X < b) = ∫f(x)dx from a to b. The probability of any single exact value is zero for a continuous variable, which feels weird but is mathematically necessary.
The cumulative distribution function (CDF) works for both types: F(x) = P(X ≤ x). It starts at 0, ends at 1, and never decreases. The CDF is underappreciated — it directly tells you what fraction of data falls below a threshold, powers percentile calculations, and enables random sampling via the inverse CDF method.
Joint, Marginal, and Conditional Distributions
With two random variables X and Y, their joint distribution P(X, Y) describes every (x, y) pair. The marginal P(X) sums out Y: P(X=x) = Σ_y P(X=x, Y=y). The conditional P(Y|X) tells you Y's distribution given a specific X.
This maps directly to ML's fundamental modeling split. Discriminative models learn P(Y|X) — given features, predict the label. Logistic regression, neural networks, SVMs. Generative models learn the joint P(X, Y), which means modeling how data is created. Naive Bayes, GANs, VAEs, diffusion models. The generative path lets you sample new data and detect outliers, but requires modeling P(X) — the distribution over the entire input space — which is much harder.
The Distribution Toolkit
Every ML model either assumes a distribution or implicitly defines one. The Gaussian noise assumption gives you MSE loss. The Bernoulli assumption gives you cross-entropy. The choice of distribution is a modeling decision with real consequences, and understanding your options is what separates someone who can tune hyperparameters from someone who can design a model from scratch.
Reference: Which Distribution When
| Distribution | Type | Support | Use Case | ML Connection |
|---|---|---|---|---|
| Bernoulli | Discrete | {0, 1} | Single yes/no outcome | Binary classifier output → cross-entropy loss |
| Binomial | Discrete | {0,...,n} | Count of successes in n trials | Correct predictions in a batch |
| Categorical | Discrete | {1,...,K} | One draw from K classes | Softmax output → categorical cross-entropy |
| Multinomial | Discrete | Count vectors | Counts across K categories | Bag-of-words document model |
| Poisson | Discrete | {0,1,2,...} | Event counts in fixed interval | Word counts, rate modeling |
| Geometric | Discrete | {1,2,3,...} | Trials until first success | Samples until rare event |
| Gaussian | Continuous | (-∞, ∞) | Measurements, sums of effects | Weight init, noise model, MSE loss |
| Multivariate Gaussian | Continuous | ℝᵈ | Correlated continuous data | GMMs, VAE latent space, GPs |
| Beta | Continuous | [0, 1] | Probability of a probability | Conjugate prior for Bernoulli/Binomial |
| Dirichlet | Continuous | Simplex | Distribution over distributions | LDA topics, prior for Categorical |
| Exponential | Continuous | [0, ∞) | Time between events | Survival analysis |
| Gamma | Continuous | (0, ∞) | Positive quantities | Prior for Poisson rate, precision |
| Student-t | Continuous | (-∞, ∞) | Gaussian with heavier tails | Robust regression, small-sample testing |
| Chi-squared | Continuous | [0, ∞) | Sum of squared Gaussians | Goodness-of-fit, feature selection |
| Uniform | Continuous | [a, b] | All values equally likely | Uninformative prior, random init |
The Gaussian — Why It's Everywhere
The Gaussian (normal) distribution is parameterized by mean μ and variance σ²: f(x) = (1/√(2πσ²)) · exp(-(x-μ)²/(2σ²)). It's symmetric, bell-shaped, and shows up everywhere. This isn't coincidence — the Central Limit Theorem (which we'll hit soon) guarantees that sums of many independent effects converge to a Gaussian regardless of the original distribution.
ML uses Gaussians constantly. We initialize neural network weights from N(0, σ²) with σ chosen carefully (He initialization uses σ² = 2/n_in). Assuming Gaussian noise on regression targets gives you MSE loss — that's not a design choice, that's a mathematical derivation we'll see in the MLE section. VAEs force the latent space to be Gaussian so you can sample smoothly. Batch normalization pushes activations toward N(0, 1). Gaussian processes put a Gaussian prior over entire functions.
The 68-95-99.7 rule: ~68% of values fall within 1σ of the mean, ~95% within 2σ, ~99.7% within 3σ. Quick mental math for outlier detection: anything beyond 3σ is suspicious. Beyond 5σ is almost certainly wrong data or a different distribution.
The Multivariate Gaussian and Mahalanobis Distance
In d dimensions, the multivariate Gaussian is parameterized by mean vector μ and covariance matrix Σ (d×d, symmetric positive definite). The PDF is: f(x) = (1/√((2π)ᵈ·det(Σ))) · exp(-½(x-μ)ᵀΣ⁻¹(x-μ)).
Σ controls the shape of the probability cloud. Diagonal → dimensions are uncorrelated, cloud is axis-aligned. Off-diagonal entries → correlations, cloud tilts. The eigenvectors of Σ are the principal axes; eigenvalues are the variance along each axis. This is what PCA computes: find the eigenvectors of the sample covariance matrix, project data onto directions of maximum variance.
The exponent (x-μ)ᵀΣ⁻¹(x-μ) is the Mahalanobis distance — a distance metric that accounts for correlations and different scales. Euclidean distance treats all dimensions equally; Mahalanobis stretches and rotates space so the Gaussian becomes a sphere, then measures Euclidean distance in that transformed space. It's the right metric for outlier detection in correlated data.
The Beta Distribution — Modeling Probabilities
The Beta distribution Beta(α, β) lives on [0, 1] — which makes it perfect for modeling a probability itself. What's the probability that our spam filter's true accuracy is 0.85? That question needs a distribution over [0, 1], and the Beta is our tool.
Its shape depends on α and β. When α = β = 1, it's uniform (total ignorance). When α > 1 and β > 1, it's bell-shaped. Think of α as "pseudo-counts of successes" and β as "pseudo-counts of failures." A Beta(10, 2) says "I've seen something like 10 successes and 2 failures" — it peaks near 0.83.
The beautiful part: Beta is the conjugate prior for Bernoulli and Binomial likelihoods. Start with prior Beta(α, β), observe k successes in n trials, and the posterior is Beta(α+k, β+n-k). No integrals. No MCMC. You add counts. That's the entire update. We'll use this for our spam filter shortly.
The Dirichlet — Multivariate Beta
The Dirichlet distribution Dir(α₁,...,αK) generalizes Beta to K dimensions. It produces vectors on the probability simplex (non-negative, sum to one). It's the conjugate prior for Categorical and Multinomial likelihoods. When all αk > 1, it favors uniform-ish distributions. When all αk < 1, it favors sparse distributions. In Latent Dirichlet Allocation (LDA), the Dirichlet prior over topic-word distributions is literally where the model gets its name.
Bernoulli, Binomial, Categorical, Poisson — and Your Loss Functions
Back to our spam filter. Each email is a Bernoulli trial: spam (1) with probability p, not-spam (0) with probability 1-p. The Bernoulli distribution is the simplest: P(X=x) = pˣ(1-p)¹⁻ˣ for x ∈ {0, 1}.
Here's where it gets good. Take the negative log of that: -[y·log(p) + (1-y)·log(1-p)]. Look familiar? That's binary cross-entropy loss. The loss function you use in every binary classifier is the negative log-likelihood of a Bernoulli. It wasn't invented — it was derived. We'll make this derivation explicit in the MLE section.
The Categorical distribution generalizes Bernoulli to K classes. The softmax output of a neural network defines a Categorical, and its negative log-likelihood is categorical cross-entropy loss. The Binomial counts successes in n Bernoulli trials. The Poisson models event counts at a constant rate and appears in NLP for word count models.
The Rest of the Family
The Exponential models time between events — memoryless, used in survival analysis. The Gamma generalizes it and serves as conjugate prior for the Poisson rate. Chi-squared is a Gamma special case — sum of squared standard normals — powering goodness-of-fit tests. The Student-t looks like a Gaussian but with heavier tails; it approaches the Gaussian as degrees of freedom grow and is the go-to for small-sample hypothesis testing. The Uniform is maximum-entropy when you only know the range.
Code: Sampling and Inspecting Distributions
import numpy as np
from scipy import stats
np.random.seed(42)
n_samples = 10_000
samples = {
'Bernoulli (p=0.3)': np.random.binomial(1, 0.3, n_samples),
'Binomial (n=20, p=0.4)': np.random.binomial(20, 0.4, n_samples),
'Poisson (λ=5)': np.random.poisson(5, n_samples),
'Gaussian (μ=0, σ=1)': np.random.normal(0, 1, n_samples),
'Beta (α=2, β=5)': np.random.beta(2, 5, n_samples),
'Exponential (λ=1.5)': np.random.exponential(1/1.5, n_samples),
}
for name, data in samples.items():
print(f"{name:30s} mean={np.mean(data):.3f} std={np.std(data):.3f} "
f"min={np.min(data):.3f} max={np.max(data):.3f}")
Expectation, Variance, and Friends
Expected Value
The expected value E[X] is the weighted average of all possible values, weighted by their probabilities. Discrete: E[X] = Σ x·P(X=x). Continuous: E[X] = ∫ x·f(x) dx. It's the "center of mass" of the distribution.
Linearity of expectation is one of the most powerful tools in probability: E[aX + bY] = a·E[X] + b·E[Y]. This holds always, even when X and Y are dependent. This is why the expected loss over a mini-batch equals the average of individual expected losses — no independence assumption needed.
Variance and Standard Deviation
Variance measures spread: Var(X) = E[(X - E[X])²] = E[X²] - (E[X])². Standard deviation σ = √Var(X) lives in the same units as X. Key properties: Var(aX + b) = a²·Var(X) — scaling squares, shifts vanish. For independent variables: Var(X + Y) = Var(X) + Var(Y). This is why averaging n independent samples reduces variance by a factor of n — the variance of the sample mean is σ²/n.
Covariance, Correlation, and the Covariance Matrix
Covariance measures how two variables move together: Cov(X, Y) = E[(X-E[X])(Y-E[Y])] = E[XY] - E[X]·E[Y]. Positive: they rise together. Negative: one rises as the other falls. Zero: no linear relationship — but nonlinear dependence can still exist.
Correlation normalizes to [-1, 1]: ρ(X,Y) = Cov(X,Y)/(σ_X·σ_Y). Scale-invariant measure of linear association.
For a random vector X = [X₁,...,Xd], the covariance matrix Σ has entry (i,j) = Cov(Xᵢ, Xⱼ). It's symmetric and positive semi-definite. Diagonal entries are variances. PCA decomposes this matrix: eigenvectors give principal components, eigenvalues give variance explained. When you "whiten" data by multiplying by Σ^(-½), you make dimensions uncorrelated with unit variance — a common preprocessing step that helps optimization converge faster.
The Two Theorems That Make ML Work
Law of Large Numbers → Monte Carlo
The Law of Large Numbers says the sample mean converges to the true mean as n grows. Take n i.i.d. samples from any distribution with mean μ. As n → ∞, the average (1/n)·Σ Xᵢ → μ with probability 1.
This is the mathematical justification for using averages as estimators. It's also why more training data helps: the empirical risk (average loss on training data) converges to the true risk (expected loss on the data distribution). And it's the foundation of Monte Carlo methods — if you can't compute an integral analytically, draw samples and average. MCMC, importance sampling, and every simulation-based method in ML stand on this theorem.
Central Limit Theorem → Why SGD Works
The Central Limit Theorem (CLT) is deeper. Take n i.i.d. samples from any distribution with mean μ and finite variance σ². The distribution of the sample mean is approximately Gaussian with mean μ and variance σ²/n, regardless of the original distribution. As n grows, the approximation tightens.
The CLT explains why Gaussians are everywhere in nature — heights, measurement errors, test scores are all sums of many small independent effects. But the ML implication is direct and profound: gradients in mini-batch SGD are averages over batch_size samples. By the CLT, the stochastic gradient is approximately Gaussian around the true gradient. The noise variance decreases as 1/batch_size. Doubling your batch size halves the gradient noise variance — but you get diminishing returns, and some noise is actually beneficial for escaping sharp minima.
import numpy as np
# CLT in action: averages of wildly non-Gaussian data become Gaussian
np.random.seed(42)
n_experiments = 10_000
# Source: heavily skewed Exponential (definitely not Gaussian)
lam = 2.0
true_mean = 1 / lam # 0.5
true_var = 1 / lam**2 # 0.25
for batch_size in [1, 5, 30, 100]:
means = np.array([
np.mean(np.random.exponential(1/lam, batch_size))
for _ in range(n_experiments)
])
clt_predicted_std = np.sqrt(true_var / batch_size)
print(f"batch_size={batch_size:3d} "
f"mean={np.mean(means):.4f} (true={true_mean:.4f}) "
f"std={np.std(means):.4f} (CLT={clt_predicted_std:.4f})")
# As batch_size grows, std shrinks and distribution becomes more Gaussian
MLE and MAP — Where Loss Functions Come From
This is the section where everything connects. Every loss function you've ever used has a probabilistic derivation, and understanding that derivation means you can invent the right loss function for any problem instead of guessing.
Maximum Likelihood Estimation (MLE)
MLE asks: what parameter values make the observed data most probable? Given data D = {x₁,...,xₙ} assumed i.i.d. from a distribution with parameter θ, the likelihood is L(θ) = P(D|θ) = Π P(xᵢ|θ). MLE finds θ* = argmax L(θ).
Products of small numbers underflow, so we always work with the log-likelihood: log L(θ) = Σ log P(xᵢ|θ). Products become sums. Maximizing the log-likelihood is equivalent (log is monotonic) but numerically stable.
Now the punchline. What happens when you write down the negative log-likelihood for specific distributional assumptions?
| Assumption | Negative Log-Likelihood | You Get |
|---|---|---|
| Gaussian noise on targets | Σ (yᵢ - ŷᵢ)² / (2σ²) + const | Mean Squared Error |
| Bernoulli output | -Σ [yᵢ·log(pᵢ) + (1-yᵢ)·log(1-pᵢ)] | Binary Cross-Entropy |
| Categorical output | -Σ log(p_yᵢ) | Categorical Cross-Entropy |
MSE isn't a design choice — it's what you get when you assume your regression targets have Gaussian noise. Cross-entropy isn't a design choice — it's what you get when you assume Bernoulli or Categorical outputs. Every time you call loss = nn.CrossEntropyLoss(), you're doing MLE under a Categorical assumption. This connection is worth internalizing deeply.
Let's see this concretely with our spam filter. Each email has a true label y ∈ {0,1} and our model outputs a spam probability p. The Bernoulli log-likelihood for one email is y·log(p) + (1-y)·log(1-p). Sum over all emails, negate, and you have binary cross-entropy. Minimizing it is the same as maximizing the likelihood that our model generated the observed spam labels. That's it. That's where the loss comes from.
Code: MLE Estimation
import numpy as np
np.random.seed(42)
# MLE for Gaussian: estimates converge as sample size grows
true_mu, true_sigma = 5.0, 2.0
for n in [10, 50, 200, 1000, 10000]:
data = np.random.normal(true_mu, true_sigma, n)
mu_mle = np.mean(data) # MLE for μ = sample mean
sigma2_mle = np.mean((data - mu_mle)**2) # MLE for σ² uses 1/n (biased)
print(f"n={n:5d} μ̂={mu_mle:.4f} (true={true_mu}) "
f"σ̂²={sigma2_mle:.4f} (true={true_sigma**2:.1f})")
# Visualize: log-likelihood peaks at the sample mean
data = np.random.normal(true_mu, true_sigma, 100)
def gaussian_log_lik(mu, data, sigma=true_sigma):
n = len(data)
return -n/2 * np.log(2*np.pi*sigma**2) - np.sum((data-mu)**2)/(2*sigma**2)
mu_grid = np.linspace(2, 8, 200)
ll_values = [gaussian_log_lik(mu, data) for mu in mu_grid]
best_mu = mu_grid[np.argmax(ll_values)]
print(f"\nGrid search peak: μ={best_mu:.3f}, sample mean: {np.mean(data):.3f}")
# They match — the MLE for the Gaussian mean IS the sample mean
MAP: Add a Prior, Get Regularization
MAP maximizes the posterior instead of the likelihood: θ_MAP = argmax P(θ|D) = argmax [P(D|θ)·P(θ)]. Taking the log: argmax [log P(D|θ) + log P(θ)]. The first term is the log-likelihood — data fit. The second is the log-prior — regularization.
This is where the famous connections come from:
| Prior on Weights | Log-Prior Penalty | Regularization |
|---|---|---|
| Gaussian: P(w) ∝ exp(-w²/(2σ²)) | -Σ wᵢ² / (2σ²) | L2 (weight decay) |
| Laplace: P(w) ∝ exp(-|w|/b) | -Σ |wᵢ| / b | L1 (sparsity) |
A Gaussian prior says "weights are probably small" and penalizes large values quadratically — that's L2 regularization. A Laplace prior penalizes linearly, driving many weights to exactly zero — that's L1 regularization and why LASSO gives sparse solutions. Every time you add weight_decay=0.01 to your optimizer, you're making a Bayesian statement: "I believe the weights have a Gaussian prior with this variance." The regularization strength λ is inversely proportional to the prior variance σ².
MLE vs. MAP vs. Full Bayesian
Fisher Information and Cramér-Rao (Brief)
The Fisher information I(θ) = E[(∂/∂θ log P(X|θ))²] measures how much information data carries about θ. High Fisher information means the log-likelihood has sharp curvature — data is very informative. The Cramér-Rao bound says no unbiased estimator can have variance less than 1/I(θ). MLE is asymptotically efficient — it hits this bound for large n. This is one reason MLE dominates: it's provably optimal in the large-sample limit.
Bayesian Inference — The Full Picture
The Workflow: Prior → Likelihood → Posterior
The Bayesian workflow is always the same three steps. (1) Choose a prior P(θ) encoding beliefs before data. (2) Choose a likelihood P(D|θ) encoding how data is generated. (3) Apply Bayes' theorem to get the posterior P(θ|D). If new data arrives, the current posterior becomes the next prior. This sequential updating is one of the great strengths of the Bayesian framework.
For our spam filter: we believe the spam rate is around 50% but we're not sure. That's a Beta(2, 2) prior. We observe 100 emails, 45 are spam. Our posterior is Beta(2+45, 2+55) = Beta(47, 57). New posterior mean: 47/104 ≈ 0.452. As more data arrives, our uncertainty shrinks and the estimate converges toward the true rate. That's Bayesian learning.
Conjugate Priors
A prior is conjugate to a likelihood if the posterior has the same form as the prior. These are the analytically tractable cases — no approximation needed.
| Likelihood | Conjugate Prior | Posterior | Update Rule |
|---|---|---|---|
| Bernoulli / Binomial | Beta(α, β) | Beta(α+k, β+n-k) | Add successes to α, failures to β |
| Gaussian (known σ²) | Gaussian(μ₀, σ₀²) | Gaussian(μₙ, σₙ²) | Precision-weighted average of prior and data |
| Poisson | Gamma(α, β) | Gamma(α+Σxᵢ, β+n) | Add total count to α, sample size to β |
| Categorical / Multinomial | Dirichlet(α₁,...,αK) | Dirichlet(α₁+c₁,...,αK+cK) | Add observed counts per category |
Posterior Predictive Distribution
Once you have P(θ|D), predict new data by integrating out the parameter: P(x_new|D) = ∫ P(x_new|θ)·P(θ|D) dθ. This is the posterior predictive. Unlike plug-in predictions (use one θ), it accounts for parameter uncertainty — if you're unsure about θ, predictions are wider. This is why Bayesian models give calibrated uncertainty: they know what they don't know.
Code: Beta-Binomial Bayesian Updating
import numpy as np
from scipy.stats import beta as beta_dist
# Bayesian spam filter: estimating the true spam rate
# True rate (unknown to us): 0.45
true_spam_rate = 0.45
# Prior: Beta(2, 2) — we mildly believe spam rate is ~50%
alpha, beta_param = 2.0, 2.0
print(f"Prior: Beta({alpha:.0f}, {beta_param:.0f}), mean = {alpha/(alpha+beta_param):.3f}\n")
np.random.seed(42)
batch_sizes = [5, 10, 20, 50, 100, 500]
total_seen = 0
for batch in batch_sizes:
emails = np.random.binomial(1, true_spam_rate, batch)
spam_count = emails.sum()
not_spam = batch - spam_count
total_seen += batch
alpha += spam_count
beta_param += not_spam
mean = alpha / (alpha + beta_param)
ci_low, ci_high = beta_dist.ppf([0.025, 0.975], alpha, beta_param)
print(f"After {total_seen:4d} emails ({spam_count:2d} spam in batch): "
f"Beta({alpha:.0f}, {beta_param:.0f}) "
f"mean={mean:.3f} 95% CI=[{ci_low:.3f}, {ci_high:.3f}]")
print(f"\nTrue rate: {true_spam_rate} | Final estimate: {alpha/(alpha+beta_param):.3f}")
print("The credible interval tightens as evidence accumulates.")
MCMC: When Conjugacy Isn't Available
Most real models don't have conjugate priors. The posterior is some complicated unnormalized density you can't integrate analytically. Markov Chain Monte Carlo (MCMC) solves this: construct a Markov chain whose stationary distribution is the posterior, run it long enough, and the samples you collect approximate the posterior.
The Metropolis-Hastings algorithm is the simplest MCMC. At each step: propose a new parameter value, compute the acceptance ratio (ratio of posterior densities), accept or reject probabilistically. Higher posterior probability? Always accept. Lower? Accept with some probability — this keeps the chain from getting stuck. The accepted samples form your approximate posterior.
Hamiltonian Monte Carlo (HMC) uses gradient information to make smarter proposals — it travels along the posterior surface instead of stumbling randomly. HMC explores high-dimensional posteriors much more efficiently. PyMC and Stan use HMC variants (NUTS — the No-U-Turn Sampler) under the hood.
Variational Inference → ELBO → VAEs
I'm still building intuition for variational inference. It took me three attempts to understand it, and I'll share the version that finally clicked.
Variational Inference (VI) takes a different approach from MCMC: instead of sampling from the posterior, approximate it with a simpler distribution q(θ; φ) and optimize φ to make q as close as possible to the true posterior.
"Close" is measured by KL divergence: KL(q || p). Minimizing this directly requires knowing the true posterior — the thing we're trying to find. The trick: instead, maximize the Evidence Lower Bound (ELBO):
ELBO = E_q[log P(D|θ)] - KL(q(θ) || P(θ))
First term: "the approximate posterior should explain the data." Second term: "the approximate posterior shouldn't stray too far from the prior." Maximizing ELBO is equivalent to minimizing KL divergence to the true posterior.
This is exactly the VAE loss function. The encoder outputs q(z|x) — an approximate posterior over latent variables. The decoder outputs P(x|z) — a likelihood. The loss is: reconstruction loss (negative log-likelihood) + KL divergence between the encoder output and a standard Gaussian prior. That's the negative ELBO. The VAE loss isn't ad hoc — it's variational inference with a specific choice of approximate posterior family.
Hypothesis Testing and A/B Testing
Null and Alternative Hypotheses
You've trained two spam filters. Model B looks 2% better on the test set. Is that real, or did you get lucky with the test split? Hypothesis testing formalizes this question.
The null hypothesis H₀ is the boring explanation: there's no real difference, the 2% is noise. The alternative hypothesis H₁: there is a real difference. You never "prove" H₁ — you either reject H₀ (strong evidence against no-difference) or fail to reject it (not enough evidence).
p-values — What They Are and What They Are NOT
The p-value is the probability of seeing data at least as extreme as what you observed, assuming H₀ is true. Small p-value → "this result would be very unlikely if there were no real effect" → evidence against H₀.
- NOT the probability that H₀ is true. A p-value of 0.03 doesn't mean "3% chance the models are the same."
- NOT the probability your result happened by chance.
- NOT a measure of effect size. A tiny, meaningless difference can produce a tiny p-value with enough data.
Type I/II Errors and Power
| H₀ True (no real difference) | H₀ False (real difference) | |
|---|---|---|
| Reject H₀ | Type I Error (false positive, rate = α) | Correct! (Power = 1 - β) |
| Fail to reject H₀ | Correct! | Type II Error (false negative, rate = β) |
Statistical power (1 - β) is the probability of detecting a real effect when one exists. It depends on effect size, sample size, significance level, and variance. Always do a power analysis before running experiments — underpowered experiments waste resources and produce unreliable conclusions.
Paired t-Tests for Model Comparison
When comparing two models on the same test set, use a paired t-test. For each sample, compute the performance difference. The paired test accounts for sample difficulty — some samples are hard for both models. It tests whether the differences are systematically non-zero.
Multiple Testing Correction
Run 20 tests at α = 0.05, and you expect one false positive even with no real effects. Bonferroni correction: divide α by the number of tests. Simple but conservative. Benjamini-Hochberg: controls false discovery rate (FDR) instead — less conservative, standard in practice when you expect some true positives among many tests.
A/B Testing: Frequentist vs. Bayesian
Frequentist A/B testing: fix a sample size, run the experiment, compute a p-value. Binary outcome: significant or not. You cannot peek at results early without inflating error rates. Bayesian A/B testing: compute the posterior probability that variant B is better. You can check results at any time, because you're updating a belief, not running a fixed-sample procedure. The Bayesian approach gives you P(B is better | data) directly — a much more useful quantity than a p-value.
Code: Paired t-Test for Model Comparison
import numpy as np
from scipy import stats
np.random.seed(42)
n_test = 200
# Two spam filters on the same test set
difficulty = np.random.uniform(0, 1, n_test)
model_a_correct = (difficulty + np.random.normal(0, 0.15, n_test) < 0.82).astype(int)
model_b_correct = (difficulty + np.random.normal(0, 0.15, n_test) < 0.78).astype(int)
print(f"Model A accuracy: {model_a_correct.mean():.3f}")
print(f"Model B accuracy: {model_b_correct.mean():.3f}")
print(f"Raw difference: {model_a_correct.mean() - model_b_correct.mean():.3f}\n")
# Paired t-test on per-sample differences
differences = model_a_correct - model_b_correct
t_stat, p_value = stats.ttest_rel(model_a_correct, model_b_correct)
print(f"Paired t-test:")
print(f" Mean difference: {differences.mean():.4f}")
print(f" t-statistic: {t_stat:.3f}")
print(f" p-value: {p_value:.4f}")
if p_value < 0.05:
print(" → Reject H₀: difference is statistically significant at α=0.05")
else:
print(" → Fail to reject H₀: no significant difference detected")
# Effect size
cohens_d = differences.mean() / differences.std()
label = 'small' if abs(cohens_d) < 0.5 else 'medium' if abs(cohens_d) < 0.8 else 'large'
print(f" Cohen's d: {cohens_d:.3f} ({label})")
Descriptive Statistics and Resampling
Summary Statistics
You know these: mean (average), median (middle value, robust to outliers), mode (most frequent). Variance and standard deviation measure spread. Skewness measures asymmetry — income distributions are right-skewed, so the mean is higher than the median. Kurtosis measures tail heaviness — high kurtosis means more outliers than a Gaussian. In practice: always look at the median alongside the mean. If they differ a lot, your distribution is skewed and the mean is misleading.
Bootstrap Confidence Intervals
The bootstrap is one of my favorite tools in all of statistics because it's so direct. Want to know the uncertainty of any statistic — any statistic at all — without making distributional assumptions? Resample your data with replacement many times. Compute the statistic on each resample. Look at the distribution of results. The 2.5th and 97.5th percentiles give you a 95% confidence interval. Works for means, medians, AUC, F1, any crazy metric you can compute.
import numpy as np
np.random.seed(42)
# Bootstrap CI for our spam filter's accuracy
n_test = 500
correct = np.random.binomial(1, 0.85, n_test) # ~85% accuracy
observed_acc = correct.mean()
n_bootstrap = 10_000
boot_accs = np.array([
np.random.choice(correct, size=n_test, replace=True).mean()
for _ in range(n_bootstrap)
])
ci_low, ci_high = np.percentile(boot_accs, [2.5, 97.5])
print(f"Observed accuracy: {observed_acc:.4f}")
print(f"Bootstrap mean: {boot_accs.mean():.4f}")
print(f"Bootstrap std: {boot_accs.std():.4f}")
print(f"95% Bootstrap CI: [{ci_low:.4f}, {ci_high:.4f}]")
print(f"\nWe're 95% confident the true accuracy is between {ci_low:.1%} and {ci_high:.1%}.")
Permutation Tests
Permutation tests assess significance by randomly shuffling labels. Want to know if two models differ? Randomly swap which predictions belong to Model A vs. B many times. If the observed difference lands in the top 5% of the shuffled distribution, reject the null at α = 0.05. Like bootstrap, permutation tests are nonparametric — no distributional assumptions needed.
Concentration Inequalities
These bounds tell you how far a random variable can stray from its expected value. They're the theoretical backbone of generalization in learning theory — and they show up in ML interviews when the conversation turns to "why does your model generalize?"
Markov's inequality: For non-negative X, P(X ≥ a) ≤ E[X]/a. Extremely loose, but requires almost nothing. Chebyshev's inequality: P(|X - μ| ≥ kσ) ≤ 1/k². Uses variance information for a tighter bound. Still distribution-free.
Hoeffding's inequality is the workhorse. For n independent bounded random variables in [a, b]: P(|X̄ - μ| ≥ t) ≤ 2·exp(-2nt²/(b-a)²). The probability of the sample mean deviating from the true mean drops exponentially with n.
This directly gives PAC (Probably Approximately Correct) learning bounds: with high probability ("probably"), the empirical risk is close to the true risk ("approximately correct"). Hoeffding tells you exactly how many samples you need for a given confidence level. Sub-Gaussian random variables extend these exponential tail bounds to unbounded distributions — and most quantities in ML (bounded losses, Gaussian noise) are sub-Gaussian.
Gaussian Mixture Models
A Gaussian Mixture Model assumes data is generated from K Gaussian components. The generative story: pick component k with probability πk (mixing weight), then sample x from N(μk, Σk). The density is P(x) = Σk πk · N(x|μk, Σk). You don't know which component generated each point — that's the latent variable.
GMMs are fit by the Expectation-Maximization (EM) algorithm. Two alternating steps:
E-step: For each data point, compute the posterior probability (responsibility) that it belongs to each component: rᵢk = πk·N(xᵢ|μk, Σk) / Σⱼ πⱼ·N(xᵢ|μⱼ, Σⱼ). This is Bayes' theorem — the prior is the mixing weight, the likelihood is the Gaussian density, the posterior is the responsibility.
M-step: Update each component's parameters using responsibilities as soft weights: μk = (Σᵢ rᵢk·xᵢ) / (Σᵢ rᵢk), and analogously for Σk and πk.
EM is guaranteed to increase (or maintain) log-likelihood each iteration — it never gets worse. But it can hit local optima, so run it multiple times with different initializations.
GMMs give you soft clustering: each point has a probability of membership in each cluster. K-means is actually a special case of GMM where all covariances are equal and isotropic, and you take hard assignments (argmax instead of soft responsibilities). When you need non-spherical clusters or uncertainty in assignments, GMMs are the tool.
Causal Inference
Correlation does not imply causation. We've all heard it a thousand times, and it still trips people up. Ice cream sales correlate with drowning deaths — hot weather is the confounder driving both. Our spam filter might learn that emails from a certain domain correlate with spam, but if we block that domain entirely, we might lose legitimate emails — because the correlation was driven by a third factor.
Simpson's paradox is the sharpest example: a trend in subgroups reverses when you combine them. A treatment can appear better overall but worse in every subgroup, because group sizes and baselines differ. It's confounding in action. The fix is conditioning on the confounder — not blindly aggregating.
Directed Acyclic Graphs (DAGs) formalize causal relationships. Judea Pearl's do-calculus distinguishes observing from intervening: P(Y | do(X=x)) ≠ P(Y | X=x) because "doing" severs the confounder's influence on X. Models trained on observational data learn correlations, not causal effects. A model that discovers "lighter → lung cancer" will make terrible interventions. As ML moves into healthcare, policy, and other high-stakes domains, causal reasoning is becoming essential, not optional.
Wrap-Up
We started with three emails and a confession about avoiding probability. We end with a toolkit that connects to nearly everything in modern ML.
Here's what we built. Probability gives us the language of uncertainty. Conditional probability and Bayes' theorem tell us how to update beliefs — and we saw how confusing P(A|B) with P(B|A) can send innocent people to prison or make your spam filter worthless. Distributions are assumptions your models make whether you realize it or not — Bernoulli gives you cross-entropy, Gaussian gives you MSE, Beta gives you a conjugate prior. MLE derives your loss functions. MAP derives your regularization. The CLT explains why SGD noise is Gaussian and what batch size does to it. Bayesian inference gives you uncertainty quantification, and variational inference powers VAEs. Hypothesis testing keeps you honest about whether your improvements are real.
I'm grateful you stuck with this. Probability was the math I resisted the longest, and it turned out to be the math I use the most. If any section felt dense, come back to it after you've seen the concept appear in a later chapter — it'll click differently when you have a concrete use case.
The spam filter we threaded through this section was a toy. But the principles behind it — updating beliefs with evidence, choosing the right distributional assumption, testing whether differences are real — are the same principles you'll use when building production systems at scale. The math doesn't change. Only the number of parameters does.
Next up, we tackle Optimization — the machinery that actually finds the parameters once you've defined what "good" means probabilistically. Everything we derived here (loss functions, regularization, the ELBO) becomes an objective that optimization minimizes. The two sections are designed to be read together.