Bayesian Fundamentals

Chapter 14: Probabilistic & Bayesian ML Bayes’ Theorem · Priors · Posteriors · Bayesian Updating

I avoided Bayesian statistics for longer than I’d like to admit. Every time someone said “prior distribution” or “posterior belief,” I’d nod, change the subject, and go back to fitting models with maximum likelihood because at least that felt concrete. A single number. An answer. But the discomfort of not understanding what people meant when they said “uncertainty-aware” kept growing. So I finally sat down with Bayes’ theorem — the actual mechanics, not the Wikipedia summary — and built everything up from scratch. Here is that dive.

Bayesian inference is a framework for updating beliefs in light of new evidence. It was formalized by Thomas Bayes in the 1760s (published posthumously) and independently by Pierre-Simon Laplace. Today it underpins everything from spam filters and A/B testing to Gaussian processes and probabilistic programming. At its core, it’s a formula for answering the question: “Given what I’ve seen, what should I believe now?”

Before we start, a heads-up. We’re going to be working with probability distributions, some light algebra, and a running example involving website buttons. You don’t need to know any of it beforehand. We’ll add what we need, one piece at a time.

This isn’t a short journey, but I hope you’ll be glad you came.

Two camps of probability
Flipping the question
The four characters
Our first Bayesian update
Conjugate priors — the algebraic shortcut
The beta-binomial, step by step
Sequential updating
Rest stop
MAP vs. MLE — the regularization secret
Credible intervals vs. confidence intervals
Choosing priors wisely
Bayesian decision theory
Wrap-up
Resources

Two Camps of Probability

Before we can understand Bayesian inference, we need to confront a question that has divided statisticians for over a century: what does “probability” even mean?

Imagine you’re holding a coin. One camp — the frequentists — says the probability of heads is 0.5 because if you flipped this coin ten thousand times, roughly half would come up heads. Probability is a property of the coin and the flipping process. It’s a long-run frequency. The coin doesn’t care what you think about it.

The other camp — the Bayesians — says probability is your degree of belief. Before you flip the coin, you believe heads and tails are equally likely. After you flip it ten times and get eight heads, your belief shifts. The coin hasn’t changed. Your knowledge about it has. Probability is in your head, not in the coin.

I’ll be honest — the first time I read that description, it felt like philosophy, not math. “Degree of belief” sounded mushy. But as we’ll see, that mushiness turns out to be extraordinarily useful. It lets you do something frequentist statistics can’t: start with imperfect knowledge, incorporate evidence as it arrives, and maintain a calibrated picture of what you know and what you don’t know.

That’s the Bayesian promise. Now we need the machinery to deliver on it.

Flipping the Question

Here’s a tiny scenario. Suppose you run a website, and you’re testing a new green button versus your current blue button. You want to know: if someone clicked the button, was it more likely green or blue?

You know a few things. Out of all visitors, 30% see the green button. The green button gets clicked by 10% of people who see it. The blue button gets clicked by 4% of people who see it. Someone clicked. Which button were they probably looking at?

Your gut says green — it has the higher click rate. But 70% of visitors see blue. There are way more blue viewers. So who actually generates more clickers?

Let’s count. Imagine 1000 visitors. 300 see green, and 10% of them click: 30 clickers. 700 see blue, and 4% of them click: 28 clickers. Total clickers: 58. Of those, 30 came from green. So given a click, the probability it came from green is 30/58, about 52%. Not the slam dunk your gut predicted.

We didn’t use any formula there. We counted. But what we actually computed has a name: Bayes’ theorem. We took a question we couldn’t answer directly (“given a click, which button?”) and decomposed it into pieces we could measure (“given each button, how often do people click?” and “how often does each button appear?”). The counting approach we used is the theorem. The formula is the counting, compressed.

The Four Characters

Bayes’ theorem has four moving parts. Each one has a specific job, and understanding those jobs is the whole game.

P(θ | D) = P(D | θ) × P(θ) / P(D)

posterior = likelihood × prior / evidence

Let’s meet them one at a time using our button example. Suppose θ represents “the click-through rate of the green button” and D is “the click data we observed.”

The prior, P(θ), is what you believed about the click-through rate before you ran the experiment. Maybe you’ve tested green buttons before and they tend to convert between 5% and 15%. That prior knowledge goes here. Think of it like your starting position on a map — maybe you’re not sure exactly where you are, but you know you’re roughly in this neighborhood.

The likelihood, P(D|θ), is how well each possible click-through rate explains the data you actually saw. If θ were 10%, how probable is it that you’d see exactly the clicks you saw? What if θ were 2%? The likelihood scores every candidate value of θ by how well it predicts your observations. This is the same likelihood that frequentists maximize when they do MLE (Maximum Likelihood Estimation). The Bayesians and the frequentists agree on this piece completely.

The evidence, P(D), is the probability of seeing this data regardless of what θ is. It averages the likelihood across all possible θ values, weighted by the prior. It’s a normalizing constant — it makes sure the posterior adds up to one. I’ll be honest: this term confused me for months. The key insight is that for most practical problems, you don’t need to compute it. The posterior is proportional to likelihood times prior, and that’s often enough.

The posterior, P(θ|D), is the payoff. It’s your updated belief after seeing the data. It combines your prior knowledge with the evidence from the experiment. If the map analogy holds, the prior was your general sense of location, the likelihood was the GPS signal, and the posterior is your refined position — sharper than either one alone.

That’s it. Those are the four characters. Everything else in this section is about making them practical.

Our First Bayesian Update

Let’s get concrete. Back to our website. You’ve launched a new green “Sign Up” button and you want to estimate its true click-through rate θ. You have no historical data for this exact button, but from past experiments, you believe click-through rates for signup buttons tend to hover around 5% to 15%. So your prior belief about θ is roughly centered at 10%, with some spread.

Now 20 visitors see the green button. 4 of them click. That’s a 20% observed click rate. Should you believe the true rate is 20%?

Probably not. Twenty visitors is a tiny sample. Your prior knowledge says rates like 20% are unusual for signup buttons. The Bayesian update will pull the estimate somewhere between your prior (centered around 10%) and the raw data (20%), weighted by how much data you have.

With only 20 visitors, the prior has considerable pull. The posterior might land around 12–14%. If you’d seen 4 clicks in 20 visits and another 15 clicks in the next 100 visits (19 clicks total in 120 visits, about 16%), the data would start to dominate and the posterior would shift toward the observed rate.

This is the fundamental Bayesian dynamic: the posterior is a compromise between prior and data. With little data, the prior matters a lot. With lots of data, the data overwhelms the prior. Given enough evidence, Bayesians and frequentists converge on the same answer. The prior is like a rubber band attached to your starting belief — it pulls hard when you don’t have much data, but stretches and lets go as evidence accumulates.

That rubber band analogy will come back. Keep it in mind.

Conjugate Priors — The Algebraic Shortcut

There’s a problem with what we’ve described so far. Computing the posterior requires that integral in the denominator — the evidence P(D). For most real problems, that integral has no closed-form solution. You’d need numerical methods, sampling, or approximations. For a conceptual understanding, that’s fine. For doing actual math on paper or updating beliefs quickly in production, it’s a headache.

But there’s a beautiful escape hatch. For certain combinations of prior and likelihood, the posterior has the same mathematical form as the prior, with updated parameters. You start with a distribution from a family, see data, and end up with another distribution from the same family — with different numbers plugged in. No integrals needed. The algebra works out.

A prior that has this property is called a conjugate prior for that likelihood. “Conjugate” means “paired with” — the prior and likelihood are a matched set that produce a clean posterior.

The most important conjugate pairs look like this. If your data comes from coin flips (Bernoulli or Binomial likelihood), the conjugate prior is the Beta distribution. If your data is normally distributed with known variance, the conjugate prior is another Normal. If you’re counting events (Poisson likelihood), the conjugate prior is the Gamma distribution. And if you’re modeling proportions across categories (Multinomial likelihood), the conjugate prior is the Dirichlet.

In each case, the update rule is the same pattern: take the prior’s parameters, add something from the data, and you’ve got the posterior’s parameters. No sampling. No optimization. Arithmetic.

The limitation is real, though. Conjugate priors exist only for certain likelihood-prior pairs. If your model is a neural network, or your likelihood is something exotic, there’s no conjugate shortcut. You’ll need the heavier tools from later sections — MCMC, variational inference, and friends. But for a surprising number of practical problems, conjugacy is all you need.

The Beta-Binomial, Step by Step

The Beta-Binomial is the workhorse of Bayesian inference. It’s the conjugate pair you’ll use most often in practice, and it’s the easiest to understand completely. Let’s build it from nothing.

Our scenario: you’re still testing that green signup button. You want to estimate θ, the true probability that a visitor clicks it. The data is binary — each visitor either clicks (success) or doesn’t (failure). That makes the likelihood Binomial.

The Beta distribution lives on the interval [0, 1], which is exactly the range of a probability. It has two parameters, α and β. Here’s the intuition: α counts “imaginary successes you’ve already seen,” and β counts “imaginary failures you’ve already seen.” A Beta(1, 1) is completely flat — every value of θ from 0 to 1 is equally likely. That’s the “I have no idea” prior. A Beta(10, 10) is peaked at 0.5 — you’re fairly confident the rate is near 50%, as if you’ve already seen 10 heads and 10 tails. A Beta(2, 8) is peaked near 0.2 — you believe the rate is low.

Now for the magic. Suppose your prior is Beta(α₀, β₀) and you observe k successes in n trials. The posterior is:

Posterior = Beta(α₀ + k, β₀ + n - k)

That’s it. Add successes to α, add failures to β. Let’s trace through it with real numbers.

We start with Beta(2, 2). That’s a gentle prior saying “I think the rate is somewhere around 50%, but I’m not very sure.” The mean of this prior is α/(α+β) = 2/4 = 0.5. Think of it as having seen 2 imaginary clicks and 2 imaginary non-clicks before the experiment even starts.

Day one: 10 visitors, 3 click. We update. α becomes 2 + 3 = 5. β becomes 2 + 7 = 9. Our posterior is Beta(5, 9), with mean 5/14 ≈ 0.357. The raw data said 30% clicked, our prior said 50%, and the posterior landed at about 36% — a compromise that leans toward the data because we saw 10 real observations against our 4 imaginary ones.

Day two: 20 more visitors, 6 click. Now we don’t go back to the original prior. We use yesterday’s posterior as today’s prior. α becomes 5 + 6 = 11. β becomes 9 + 14 = 23. Posterior: Beta(11, 23), mean ≈ 0.324. With 30 real data points, the prior’s influence is fading. The rubber band is stretching.

Day three: 50 visitors, 14 click. α = 11 + 14 = 25. β = 23 + 36 = 59. Posterior: Beta(25, 59), mean ≈ 0.298. After 80 observations, our estimate has settled near 30%. The original prior centered at 50% is barely detectable. The data did its job.

One thing worth noticing: the posterior mean, α/(α+β), is a weighted average of the prior mean and the data proportion. The weights are the effective sample sizes — the prior’s pseudocounts versus the real observation count. As real data accumulates, the prior’s pseudocounts become negligible. This is why Bayesian and frequentist answers converge with enough data. It’s not a coincidence. It’s arithmetic.

Sequential Updating

We already snuck in the most elegant property of Bayesian inference in the previous section: today’s posterior becomes tomorrow’s prior. But it’s worth pausing on why this matters.

In frequentist statistics, if you want to incorporate new data, you typically need to re-analyze everything from scratch. If you collected data in three batches, you combine all the raw data and run the test once. Worse, if you peeked at the results after each batch — checked whether the p-value was significant yet — you’ve committed a statistical sin called multiple testing. Your false positive rate inflates, and your results are no longer trustworthy at the stated significance level.

Bayesian updating has no such problem. The posterior is always valid, regardless of when or how often you look at it. You can check after 10 visitors, again after 50, again after 200. Each time, the posterior reflects all the evidence seen so far. No corrections needed. No guilt.

from scipy import stats

alpha, beta_param = 2, 2    # prior: Beta(2, 2)

batches = [
    (3, 7),     # day 1: 3 clicks, 7 no-clicks
    (6, 14),    # day 2: 6 clicks, 14 no-clicks
    (14, 36),   # day 3: 14 clicks, 36 no-clicks
]

print(f"Start:       Beta({alpha}, {beta_param}), "
      f"mean = {alpha/(alpha+beta_param):.3f}")

for i, (hits, misses) in enumerate(batches, 1):
    alpha += hits
    beta_param += misses
    lo = stats.beta.ppf(0.025, alpha, beta_param)
    hi = stats.beta.ppf(0.975, alpha, beta_param)
    print(f"After day {i}: Beta({alpha}, {beta_param}), "
          f"mean = {alpha/(alpha+beta_param):.3f}, "
          f"95% interval = [{lo:.3f}, {hi:.3f}]")

Run this and you’ll see the interval shrink with each batch. The estimate stabilizes. And here’s the punchline: you get the exact same posterior whether you process all 80 observations at once or in three batches. The math doesn’t care about your schedule. This is why Bayesian methods are natural for A/B testing, online learning, and any setting where data arrives in streams rather than all at once.

Rest Stop

Congratulations on making it this far. If you need to stop here, you can.

You now have a working mental model of Bayesian inference: start with a prior belief, observe data, update to a posterior. You know that conjugate priors give you closed-form updates, that the Beta-Binomial is the most common one, and that sequential updating lets you incorporate evidence as it arrives without starting over. That’s a genuinely useful toolkit. You can run Bayesian A/B tests, estimate proportions with uncertainty, and explain to a colleague why their confidence interval doesn’t mean what they think it means.

But there’s more machinery under the hood. We haven’t talked about how Bayesian inference connects to regularization (every L2 penalty is secretly a Gaussian prior). We haven’t confronted the difference between credible intervals and confidence intervals, which trips up even experienced practitioners. And we haven’t discussed how to choose priors wisely, or how to make optimal decisions when the stakes are asymmetric.

The short version: MAP estimation is regularized MLE. Credible intervals answer the question you actually want to ask. Priors should rule out the absurd without trying to be clever. There. You’re 70% of the way there.

But if the discomfort of not knowing what’s underneath is nagging at you, read on.

MAP vs. MLE — The Regularization Secret

We’ve been working with full posterior distributions — entire curves describing our uncertainty. That’s the ideal. But sometimes you need a single number. A point estimate. A best guess.

There are two obvious ways to extract a number from the posterior. Maximum Likelihood Estimation (MLE) ignores the prior entirely and asks: what parameter value makes the observed data most probable? It maximizes P(D|θ). Maximum A Posteriori (MAP) asks a slightly different question: what parameter value is most probable given both the data and the prior? It maximizes P(θ|D), which is proportional to P(D|θ) × P(θ).

The difference is the prior term. And this is where things get interesting.

Take the log of the MAP objective. You get: log P(D|θ) + log P(θ). The first term is the log-likelihood — the same thing MLE maximizes. The second term is the log-prior. Now, if your prior is a Gaussian centered at zero — P(θ) ∝ exp(-θ²/2σ²) — then log P(θ) = -θ²/2σ² plus a constant. That’s an L2 penalty on θ. MAP with a Gaussian prior is exactly Ridge regression. The regularization strength λ equals 1/σ².

If your prior is a Laplace distribution instead — peaked at zero with heavier tails — then log P(θ) = -|θ|/b plus a constant. That’s an L1 penalty. MAP with a Laplace prior is exactly Lasso regression.

Every time you’ve added a regularization penalty to a loss function, you’ve been doing Bayesian MAP estimation with a specific prior distribution. You were being Bayesian all along. You didn’t know it. I find that deeply satisfying.

I’m still building my intuition for when full Bayesian inference — integrating over the entire posterior rather than taking the mode — is worth the computational cost over MAP. The honest answer: it matters most when data is scarce and uncertainty is the thing you actually care about. With large datasets, MAP and full Bayes tend to agree. With small datasets, the full posterior gives you calibrated uncertainty that MAP cannot. The posterior mean also tends to be a better point estimate than the posterior mode, especially for skewed distributions.

Credible Intervals vs. Confidence Intervals

This is the thing that trips up everyone, and I mean everyone. I still have to pause and think about it carefully when explaining it to colleagues.

A 95% credible interval says: “Given the data I’ve observed, there is a 95% probability that the true parameter lies in this interval.” That’s the intuitive statement. That’s what you want to say when you report results.

A 95% confidence interval says something maddeningly different: “If I repeated this entire experiment infinitely many times, and each time I constructed an interval using this procedure, 95% of those intervals would contain the true parameter.” It says absolutely nothing about whether this particular interval contains the true value. The parameter is either in there or it isn’t — it’s a fixed number, and fixed numbers don’t have probabilities in the frequentist worldview.

Here’s a way to feel the difference. Suppose you computed a 95% confidence interval and got [0.23, 0.41]. Can you say “there’s a 95% chance the true value is between 0.23 and 0.41”? Strictly speaking, no. Not as a frequentist. The true value is fixed. Your interval is the random thing. In a Bayesian framework, you absolutely can make that statement, because you treat the parameter as a random variable with a distribution.

The irony is that nearly everyone interprets confidence intervals as if they were credible intervals. When your colleague says “there’s a 95% chance the true effect is in this range,” they’re thinking Bayesian — whether they realize it or not.

For large samples with vague priors, the two intervals are nearly identical numerically. The philosophical difference surfaces most with small data or strong priors, which is exactly when the Bayesian approach earns its keep.

Choosing Priors Wisely

The prior is the part that makes people nervous. “You’re putting your thumb on the scale,” they say. And they’re not wrong. But the alternative — pretending you have no prior knowledge — is also putting your thumb on the scale. A flat prior over all real numbers isn’t “no assumption.” It’s the assumption that a parameter value of 10 billion is as plausible as a value of 0.5. That’s a very strong statement, and usually an absurd one.

The practical wisdom, largely shaped by Andrew Gelman and the Stan community, comes down to three tiers.

Weakly informative priors are the sweet spot for most problems. They rule out absurd values without committing to specific ones. For regression coefficients on standardized data, Normal(0, 2.5) is a good default. For standard deviations, a Half-Normal or Half-Cauchy centered at zero constrains the parameter to be positive and moderate without being dogmatic. The guiding question is: “Would I be shocked if the true value were outside this range?” If yes, your prior should make such values unlikely. If no, widen it.

Informative priors encode genuine domain knowledge. If previous experiments consistently show click-through rates between 2% and 15%, encode that. If physics constrains a coefficient to be positive, say so. Informative priors are powerful when the knowledge is real and dangerous when it’s wishful thinking.

Non-informative priors (flat, Jeffreys’) try to “let the data speak.” Jeffreys’ prior has an elegant property — it’s invariant under reparameterization, so switching from measuring in meters to feet doesn’t change the inference. For a Bernoulli parameter, Jeffreys’ prior is Beta(0.5, 0.5). But truly non-informative priors can behave pathologically in high dimensions or with complex models. They’re a starting point, not a destination.

No one agrees on the “right” prior for a given problem, and that’s actually fine. The standard practice is to run a prior predictive check: simulate data from your prior and see if the predictions look reasonable. If your prior implies that a click-through rate of 99% is plausible, something is off. Adjust and repeat. It’s an iterative process, not a one-shot decision.

Bayesian Decision Theory

Everything we’ve built so far gives us a posterior — a distribution over what we believe. But at some point, you have to do something. Ship the green button or keep the blue one. Treat the patient or wait. Approve the loan or decline it. Bayesian decision theory is the bridge from belief to action.

The core idea is that different mistakes cost different amounts. Declaring a button “better” when it’s actually worse is annoying — you lose some conversions until you notice. Telling a patient they don’t have cancer when they do is catastrophic. The optimal action depends not only on what you believe (the posterior) but on what’s at stake (the loss function).

Back to our button example. Suppose the green button’s posterior mean conversion rate is 12%, and the blue button’s is 10%. Green looks better. But what if switching buttons costs $5,000 in engineering time, and the revenue difference is negligible unless the true lift is at least 3 percentage points? The posterior tells you the probability that the true lift exceeds 3 points. If that probability is low, the optimal decision might be to keep the blue button despite the green one’s higher point estimate.

Formally, given a set of possible actions a, a loss function L(θ, a) that quantifies the cost of taking action a when the true state is θ, and a posterior P(θ|D), the Bayes-optimal action minimizes the expected posterior loss:

a* = argmin_a  E[L(θ, a)]
   = argmin_a  ∫ L(θ, a) × P(θ|D) dθ

When the loss is symmetric (you care equally about over- and under-estimating), the optimal point estimate is the posterior mean. When the loss is absolute (you care about magnitude but not direction), it’s the posterior median. When you want the single most probable value, it’s the posterior mode (MAP). Different loss functions, different optimal actions — all from the same posterior.

import numpy as np

# A/B test decision with asymmetric costs
# Green button posterior: Beta(25, 59), mean ≈ 0.298
# Blue button posterior: Beta(40, 360), mean ≈ 0.100

alpha_green, beta_green = 25, 59
alpha_blue, beta_blue = 40, 360

samples_green = np.random.beta(alpha_green, beta_green, 100_000)
samples_blue = np.random.beta(alpha_blue, beta_blue, 100_000)

# What fraction of the time does green beat blue?
p_green_wins = np.mean(samples_green > samples_blue)

# What's the probability the lift exceeds 3 percentage points?
lift = samples_green - samples_blue
p_meaningful_lift = np.mean(lift > 0.03)

print(f"P(green > blue) = {p_green_wins:.3f}")
print(f"P(lift > 3pp) = {p_meaningful_lift:.3f}")
print(f"Expected lift = {np.mean(lift):.3f}")
print(f"95% credible interval for lift: "
      f"[{np.percentile(lift, 2.5):.3f}, "
      f"{np.percentile(lift, 97.5):.3f}]")

This is the kind of question a business actually needs answered. Not “is the difference statistically significant?” but “what’s the probability the difference is big enough to be worth acting on?” That’s a decision question, and Bayesian inference answers it directly.

Wrap-Up

If you’re still with me, thank you. I hope it was worth it.

We started with a philosophical question — what does probability mean? — and built up a complete machinery for updating beliefs with evidence. We traced through Bayes’ theorem by counting, met the four characters (prior, likelihood, evidence, posterior), worked through the Beta-Binomial step by step with a running website example, discovered that sequential updating lets you peek at results without guilt, found that regularization is secretly Bayesian MAP estimation, confronted the credible-vs-confidence interval confusion, discussed how to choose priors that are honest without being dogmatic, and connected posterior beliefs to real-world decisions through loss functions.

My hope is that the next time you see a posterior distribution or hear someone mention “putting a prior on it,” instead of nodding and changing the subject, you’ll picture that rubber band between prior and data, and have a pretty good mental model of what’s going on under the hood.

Resources

These are the resources I found most helpful while building my own understanding:

Bayesian Methods for Hackers by Cameron Davidson-Pilon — hands-on, code-first introduction that avoids unnecessary formalism. Freely available on GitHub. Wildly helpful for building intuition.
Statistical Rethinking by Richard McElreath — the best textbook on applied Bayesian statistics I’ve encountered. The lectures (available on YouTube) are equally good. McElreath builds everything from scratch.
Bayesian Data Analysis by Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin — the O.G. reference. Dense but authoritative. The prior selection chapters alone are worth the price.
Probability Theory: The Logic of Science by E.T. Jaynes — opinionated, brilliant, and occasionally infuriating. Jaynes makes the case for Bayesian probability as the extension of logic itself.
Think Bayes by Allen Downey — a gentle, Python-based introduction. Good for the first weekend of your Bayesian journey.
The Stan User’s Guide (mc-stan.org) — not a textbook, but the prior recommendation sections are the most practical guide to prior selection I’ve found.

Next → Graphical Models & Inference