Sequential Models & Probabilistic Programming Languages
I avoided sequential probabilistic models for an embarrassingly long time. Every time a colleague mentioned “forward-backward” or “Kalman gain,” I nodded along, privately hoping nobody would ask me a follow-up question. The math looked dense, the notation was intimidating, and there seemed to be a dozen variants of everything. Finally the discomfort of not knowing what’s really happening under the hood grew too great for me. Here is that dive.
Sequential models — Hidden Markov Models, Kalman filters, particle filters — are the classic machinery for tracking a hidden state that evolves over time from noisy observations. They powered speech recognition, GPS navigation, and robot localization long before deep learning existed. Probabilistic programming languages (PPLs) are the modern layer on top: tools that let you specify any generative model as code and have the computer figure out the inference for you. Together, these ideas form the backbone of Bayesian reasoning about things that change.
Before we start, a heads-up. We’re going to be talking about probability distributions, matrix operations, and Bayesian inference, but you don’t need to know any of it beforehand. We’ll add what we need, one piece at a time.
This isn’t a short journey, but I hope you’ll be glad you came.
A tiny weather station
Hidden Markov Models — the discrete case
The forward algorithm
The backward algorithm and smoothing
The Viterbi algorithm
Learning the model — Baum-Welch
Rest stop
Kalman filters — the continuous case
The predict-update cycle
When linearity breaks — EKF and UKF
Particle filters — throwing darts at the posterior
Probabilistic programming languages
Writing a model in PyMC
Stan, NumPyro, and Pyro
Posterior predictive checks
Model comparison — WAIC and LOO-CV
Bayesian neural networks
Uncertainty quantification
Resources and credits
The hidden state problem
Here is a situation that comes up constantly in the real world: there’s something you care about, but you can’t observe it directly. You can only see noisy, imperfect reflections of it.
A doctor can’t see the disease directly — only symptoms. A self-driving car can’t see its exact position — only GPS readings that jitter by a few meters. A stock trader can’t see the “regime” the market is in — only price movements that could mean anything.
The thing you care about is the hidden state. The thing you can measure is the observation. And the question is always the same: given the observations, what can we say about the hidden state?
The answer, it turns out, is a loop. Predict what you think the state will be next. Observe what actually happens. Correct your prediction. Repeat. That loop — predict, observe, correct — is the heartbeat of every sequential model we’ll build in this section. It’s Bayes’ rule applied over and over, one time step at a time.
A tiny weather station
Let’s make this concrete with a toy example we’ll keep coming back to throughout the entire section.
Imagine you run a small ice cream shop in a town where the weather forecast is unreliable. You know the weather affects your sales: sunny days mean more cones, rainy days mean fewer. But you don’t have a window — you work in the back kitchen. All you see is how many cones you sold each day. From that sales number, you want to infer what the weather was.
We’ll start absurdly small. The weather has two possible states: Sunny and Rainy. Each day, you observe one of three sales levels: Low (0–10 cones), Medium (11–25 cones), or High (26+ cones). That’s it. Two hidden states, three possible observations.
We also know a few things about how weather works in this town. If today is sunny, there’s a 70% chance tomorrow will be sunny too, and a 30% chance it’ll rain. If today is rainy, there’s a 60% chance tomorrow will still be rainy, and a 40% chance the sun comes out. These are transition probabilities — the odds of moving from one hidden state to another.
And we know how weather affects sales. On sunny days, there’s a 60% chance of high sales, 30% medium, and 10% low. On rainy days, it’s flipped: 10% high, 30% medium, 60% low. These are emission probabilities — the odds of seeing a particular observation given the hidden state.
This tiny setup — two states, three observations, a transition matrix, and an emission matrix — is all we need to build our first sequential model. Keep it in mind. We’ll use it to walk through every algorithm that follows.
Hidden Markov Models — the discrete case
Our ice cream shop setup has a formal name: a Hidden Markov Model, or HMM. The “Markov” part means the hidden state tomorrow depends only on the hidden state today — not on last week, not on last month. The weather tomorrow is a function of today’s weather alone. That’s the Markov property: the future is independent of the past, given the present. The “hidden” part means we never see the state directly.
An HMM has three ingredients. First, a set of hidden states — in our case, Sunny and Rainy. Second, a transition matrix A, where Aij is the probability of moving from state i to state j. Third, an emission matrix B, where Bij is the probability of observing output j when in state i. There’s also an initial distribution π that tells us what state we think the system starts in.
For our weather station:
Transition matrix A: Emission matrix B:
Sunny Rainy Low Med High
Sunny [ 0.7 0.3 ] Sunny [ 0.1 0.3 0.6 ]
Rainy [ 0.4 0.6 ] Rainy [ 0.6 0.3 0.1 ]
Initial distribution π: Sunny = 0.5, Rainy = 0.5
Now suppose we observe three days of sales: High, Medium, Low. What was the weather? That’s the question HMMs are built to answer, and there are three distinct ways to answer it.
The forward algorithm
The first question we might ask is: given our model, how likely is this particular observation sequence? How probable is it that we’d see High, Medium, Low from this specific weather model? This is the evaluation problem, and it’s solved by the forward algorithm.
The intuition is that we sweep through time left to right, accumulating evidence. At each time step, we ask: what’s the probability of being in each hidden state, given everything we’ve observed so far?
Let’s trace through it by hand with our three-day sequence: High, Medium, Low.
Day 1: We observe High. We start with our initial distribution (50/50) and multiply by the emission probability of High from each state. Sunny gives 0.5 × 0.6 = 0.30. Rainy gives 0.5 × 0.1 = 0.05. So after day 1, Sunny is looking much more likely. These values — 0.30 for Sunny, 0.05 for Rainy — are our forward probabilities α1.
Day 2: We observe Medium. For each state today, we consider every state from yesterday, multiply the forward probability by the transition probability, sum them up, and multiply by today’s emission probability. For Sunny today: (0.30 × 0.7 + 0.05 × 0.4) × 0.3 = 0.069. For Rainy today: (0.30 × 0.3 + 0.05 × 0.6) × 0.3 = 0.036. Sunny is still ahead, but the gap is narrowing.
Day 3: We observe Low. Same procedure. Sunny today: (0.069 × 0.7 + 0.036 × 0.4) × 0.1 = 0.006. Rainy today: (0.069 × 0.3 + 0.036 × 0.6) × 0.6 = 0.025. Now Rainy is dominant. The low sales shifted the balance.
The total probability of the observation sequence is the sum of the final forward probabilities: 0.006 + 0.025 = 0.031. That’s how likely it is to see High, Medium, Low from our weather model.
The beauty of the forward algorithm is that it runs in O(T × K²) time, where T is the sequence length and K is the number of hidden states. Without it, we’d have to enumerate every possible state sequence — KT of them — which blows up fast.
The backward algorithm and smoothing
The forward algorithm tells us about the past: what’s the probability of being in each state given everything we’ve seen up to now? But what if we have the complete sequence and want to ask about a state in the middle? The forward algorithm alone can’t use future observations to refine past estimates.
That’s where the backward algorithm comes in. It sweeps right to left, computing the probability of seeing the remaining observations given each possible state at the current time. Think of it as information flowing backward from the future.
When we multiply the forward and backward probabilities together and normalize, we get the smoothed posterior: the probability of being in each state at each time step, given the entire observation sequence. This is called forward-backward smoothing, and it’s the workhorse of HMM inference.
I’ll be honest — the first time I saw this, it felt like a magic trick. Information from the past meets information from the future, and where they overlap, that’s your best guess. Another way to think about it: the forward pass is like reading a mystery novel from the beginning, and the backward pass is like reading from the end. Combining them gives you the full picture that neither direction alone can provide.
The Viterbi algorithm
The forward-backward algorithm tells us the probability of each state at each time step independently. But sometimes we want something different: the single most likely sequence of states that explains all our observations. That’s the decoding problem, and the Viterbi algorithm solves it.
The subtle but important distinction: the most likely state at each time step (from forward-backward) might not form the most likely sequence. Imagine a situation where state A is the most probable at time 2 and state B is the most probable at time 3, but the transition from A to B has zero probability. The individual best states can be globally inconsistent. Viterbi finds the globally best path.
Viterbi is dynamic programming, very similar to the forward algorithm, but with one change: instead of summing over all previous states, we take the maximum. At each time step, for each state, we remember which previous state gave us the highest probability path to get here. Then at the end, we trace back from the winning final state to recover the best path.
Let’s trace through our weather example with the same observations: High, Medium, Low.
Day 1 (High): Sunny: 0.5 × 0.6 = 0.30. Rainy: 0.5 × 0.1 = 0.05. Best predecessor: none (it’s day 1).
Day 2 (Medium): Sunny: max(0.30 × 0.7, 0.05 × 0.4) × 0.3 = max(0.21, 0.02) × 0.3 = 0.063. Best predecessor: Sunny. Rainy: max(0.30 × 0.3, 0.05 × 0.6) × 0.3 = max(0.09, 0.03) × 0.3 = 0.027. Best predecessor: Sunny.
Day 3 (Low): Sunny: max(0.063 × 0.7, 0.027 × 0.4) × 0.1 = 0.0044. Best predecessor: Sunny. Rainy: max(0.063 × 0.3, 0.027 × 0.6) × 0.6 = max(0.019, 0.016) × 0.6 = 0.011. Best predecessor: Sunny.
Trace back: Day 3 winner is Rainy (0.011 > 0.0044). Its best predecessor at day 2 was Sunny. Sunny’s best predecessor at day 1 was Sunny. So the most likely weather sequence was: Sunny → Sunny → Rainy.
That makes intuitive sense. High sales on day 1 suggest sun, medium sales on day 2 could go either way but Sunny-to-Sunny is a strong transition, and low sales on day 3 finally flip us to Rainy.
This is the algorithm that powered speech recognition for decades — figuring out the most likely word sequence given a stream of audio features. It’s also behind gene finding in bioinformatics, where the hidden states are “coding region” and “non-coding region” of DNA.
Learning the model — Baum-Welch
So far we’ve assumed we know the transition and emission matrices. In practice, we rarely do. We have a pile of observation sequences and need to learn the model parameters. This is the learning problem, and it’s solved by the Baum-Welch algorithm, which is the Expectation-Maximization (EM) algorithm applied specifically to HMMs.
The idea: start with random guesses for A, B, and π. Run forward-backward to estimate which states the system was probably in at each time step (E-step). Use those estimates to update the transition and emission probabilities (M-step). Repeat until the parameters stop changing.
Each iteration is guaranteed to increase (or at least not decrease) the likelihood of the observed data. But like all EM algorithms, it can get stuck in local optima, so in practice you run it multiple times from different starting points and keep the best result.
from hmmlearn import hmm
import numpy as np
np.random.seed(42)
# Our ice cream shop: 2 hidden weather states, 3 observation levels
true_model = hmm.CategoricalHMM(n_components=2, n_iter=100)
true_model.startprob_ = np.array([0.5, 0.5])
true_model.transmat_ = np.array([[0.7, 0.3],
[0.4, 0.6]])
true_model.emissionprob_ = np.array([[0.1, 0.3, 0.6], # Sunny
[0.6, 0.3, 0.1]]) # Rainy
# Generate 500 days of observations
observations, true_states = true_model.sample(500)
# Now pretend we only see the observations.
# Learn the model from scratch.
learned = hmm.CategoricalHMM(n_components=2, n_iter=200, random_state=42)
learned.fit(observations)
print("True transition matrix:")
print(true_model.transmat_.round(2))
print("\nLearned transition matrix:")
print(learned.transmat_.round(2))
# Decode: what weather sequence best explains the sales?
predicted_states = learned.predict(observations)
accuracy = np.mean(predicted_states == true_states)
print(f"\nState recovery accuracy: {accuracy:.1%}")
That last block does something remarkable. We gave the model 500 days of sales numbers — nothing about the weather — and it recovered the transition structure almost exactly. The state labels might be swapped (Baum-Welch doesn’t know which state is “Sunny”), but the structure is there.
Rest stop. If you’ve made it this far, congratulations. You now have a working mental model of Hidden Markov Models: a system with hidden states that transition over time, observable outputs that depend on those states, and three algorithms — forward (evaluation), Viterbi (decoding), and Baum-Welch (learning) — that let you do useful things with the model.
This is a legitimately useful place to stop. HMMs are still widely used for regime detection in finance, sequence labeling in NLP, and gene finding in bioinformatics. If you can describe the three HMM problems and their algorithms, you’re ahead of most practitioners.
But there’s a limitation that might be nagging at you. HMMs only handle discrete hidden states. Our weather was either Sunny or Rainy — nothing in between. What if the hidden state is a continuous quantity, like the true position of a drone or the actual temperature of an engine? We need a different tool for that. And what if even that tool is too restrictive — what if the dynamics are nonlinear, or the noise isn’t Gaussian? Then we need yet another approach.
If the discomfort of not knowing what’s underneath is nagging at you, read on.
Kalman filters — the continuous case
HMMs assume the hidden state jumps between a fixed set of categories. Kalman filters solve the same predict-observe-correct problem, but for continuous states. Instead of “Sunny or Rainy,” the hidden state is a number — or a vector of numbers — like the position and velocity of a moving object.
Let’s adapt our ice cream scenario. Instead of discrete weather states, imagine the hidden state is the actual temperature outside — a continuous number. You still can’t see it from the kitchen, but temperature directly influences how many cones you sell. Your sales count is a noisy measurement of the temperature. And temperature evolves smoothly: today’s temperature is roughly yesterday’s temperature plus a bit of random variation.
The Kalman filter makes two specific assumptions that let it solve this problem in closed form. First, the dynamics are linear: the state at time t is a linear function of the state at time t-1, plus Gaussian noise. Second, the observations are linear functions of the state, again plus Gaussian noise. Under these assumptions, the posterior over the hidden state is always a Gaussian, and the update formulas are exact.
The predict-update cycle
The Kalman filter operates in two alternating steps, and I find it helpful to think of them as breathing. Inhale: predict. Exhale: update.
Predict: Before seeing the new observation, push your current estimate forward using the dynamics model. “If temperature was 22°C yesterday and tends to drift slowly, I predict about 22°C today.” But your uncertainty grows during prediction — the world had a chance to do something unexpected. Think of it as a Gaussian bell curve getting wider.
Update: Now you observe today’s ice cream sales. They suggest 25°C. But the sales are noisy. The update step blends your prediction with the measurement, weighted by how confident you are in each. If your prediction is very certain, you mostly trust it. If the measurement is very precise, you mostly trust that. The blending factor is called the Kalman gain.
The Kalman gain, K, is a number between 0 and 1 (in the 1D case). When K is near 1, you trust the measurement more than the prediction — your model is uncertain. When K is near 0, you trust the prediction more — the measurement is noisy. It’s a trust knob that the filter automatically adjusts at every time step based on the relative uncertainties.
After the update, your uncertainty shrinks. The bell curve gets narrower. Then the whole thing repeats. Predict (uncertainty grows), update (uncertainty shrinks), predict, update. That rhythm — that breathing — is the same predict-observe-correct loop we saw in HMMs, but now in the continuous domain.
import numpy as np
class KalmanFilter1D:
def __init__(self, process_noise=0.1, measurement_noise=1.0,
initial_estimate=0.0, initial_uncertainty=1.0):
self.Q = process_noise # how much the state drifts each step
self.R = measurement_noise # how noisy the sensor is
self.x = initial_estimate # our best guess of the state
self.P = initial_uncertainty # how uncertain we are about x
def predict(self):
# State doesn't change (random walk), but uncertainty grows
self.P = self.P + self.Q
def update(self, measurement):
# Kalman gain: how much to trust the measurement
K = self.P / (self.P + self.R)
# Blend prediction with measurement
self.x = self.x + K * (measurement - self.x)
# Uncertainty shrinks after seeing data
self.P = (1 - K) * self.P
return self.x, self.P, K
# Simulate: true temperature drifts, we measure noisily
np.random.seed(42)
true_temps = 20 + np.cumsum(np.random.randn(30) * 0.5)
measurements = true_temps + np.random.randn(30) * 3.0
kf = KalmanFilter1D(process_noise=0.25, measurement_noise=9.0)
estimates = []
for z in measurements:
kf.predict()
est, unc, gain = kf.update(z)
estimates.append(est)
mse_raw = np.mean((measurements - true_temps)**2)
mse_filtered = np.mean((np.array(estimates) - true_temps)**2)
print(f"Raw measurement MSE: {mse_raw:.2f}")
print(f"Kalman filtered MSE: {mse_filtered:.2f}")
print(f"Noise reduction: {(1 - mse_filtered/mse_raw)*100:.0f}%")
The Kalman filter typically cuts the measurement error dramatically. It’s not doing anything mysterious — it’s exploiting the fact that the true state changes slowly, so each new measurement can be averaged against the accumulated history. What’s remarkable is that it does this optimally for the linear-Gaussian case. You cannot build a better estimator under those assumptions. Rudolf Kalman proved this in 1960, and it’s still one of the most widely deployed algorithms in all of engineering.
When linearity breaks — EKF and UKF
The standard Kalman filter is beautiful, but it has a serious limitation: it requires linear dynamics. Real systems are rarely linear. A rocket’s trajectory involves trigonometry. A robot arm’s kinematics involve nonlinear joint angles. Our ice cream sales might have a nonlinear relationship with temperature — above 35°C, people stay indoors and sales drop.
Two extensions address this, each with a different philosophy.
The Extended Kalman Filter (EKF) takes the brute-force approach: linearize the nonlinear dynamics at each time step using a first-order Taylor expansion. It computes the Jacobian (the matrix of partial derivatives) of the dynamics function and the observation function, and then runs the standard Kalman equations on the linearized versions. This works well when the nonlinearity is mild — a gentle curve that looks roughly straight over the range of uncertainty. It’s still the workhorse for GPS and inertial navigation. But if the nonlinearity is strong, the linearization can be a poor approximation, and the filter can diverge.
The Unscented Kalman Filter (UKF) takes a cleverer approach. Instead of linearizing the function, it picks a small set of carefully chosen sample points — called sigma points — and pushes them through the actual nonlinear function. Then it recovers a Gaussian approximation from the transformed points. No Jacobians needed. The insight is beautiful: it’s easier to approximate a Gaussian distribution than to approximate an arbitrary nonlinear function. The UKF is more accurate than the EKF for strongly nonlinear systems, and it’s only slightly more expensive to compute.
I’m still developing my intuition for exactly when EKF fails and UKF saves the day, but the rule of thumb I’ve found most useful: if you can compute Jacobians and the nonlinearity is gentle, EKF is fine. If Jacobians are painful or the nonlinearity is strong, reach for UKF. If even UKF struggles, you need particle filters.
Particle filters — throwing darts at the posterior
Both the Kalman filter and its extensions make one crucial assumption: the posterior is Gaussian. That’s fine when it’s roughly bell-shaped. But what about posteriors with multiple peaks? What if a robot is lost in a building and there are three corridors that all look the same — the posterior over its location has three modes? A Gaussian can’t represent that.
Particle filters, also called sequential Monte Carlo methods, abandon the Gaussian assumption entirely. Instead of tracking the posterior as a mean and covariance, they represent it as a cloud of weighted samples — particles — scattered across the state space.
The analogy I find most useful: imagine you’re trying to find a cat in a dark warehouse. You can’t see anything, but you have a microphone. You scatter 1,000 imaginary copies of the cat across the warehouse — those are your particles. When you hear a meow, you check which particles are near where the sound seems to come from. Those particles get higher weights. Low-weight particles are probably in the wrong spot. Then you resample: draw a new set of 1,000 particles, but biased toward the high-weight ones. Some well-placed particles get duplicated; badly-placed ones disappear. Then each particle takes a random step (the dynamics). Then you hear another meow and repeat.
Over time, the particles converge on the cat’s true location, even if there were initially multiple plausible spots. The posterior can be multimodal, heavy-tailed, whatever. The particles don’t care.
The cost is computation. You need a lot of particles — hundreds or thousands — and the number grows badly with the dimensionality of the state. In high dimensions, most of the particles end up in low-probability regions, and you need exponentially more to cover the space. This is the curse of dimensionality applied to particle filters, and it’s the reason they’re mainly used for low-dimensional problems: robot localization (2D or 3D position), tracking a few objects, or estimating a handful of parameters.
import numpy as np
def particle_filter_1d(measurements, n_particles=1000,
process_noise=0.5, measurement_noise=3.0):
# Initialize particles uniformly around a rough guess
particles = np.random.randn(n_particles) * 5 + 20
weights = np.ones(n_particles) / n_particles
estimates = []
for z in measurements:
# Predict: each particle takes a random step
particles += np.random.randn(n_particles) * process_noise
# Update: weight each particle by how well it
# explains the measurement
likelihood = np.exp(-0.5 * ((z - particles) / measurement_noise)**2)
weights = likelihood / likelihood.sum()
# Estimate: weighted average of particles
estimates.append(np.average(particles, weights=weights))
# Resample: duplicate likely particles, drop unlikely ones
indices = np.random.choice(n_particles, size=n_particles,
replace=True, p=weights)
particles = particles[indices]
weights = np.ones(n_particles) / n_particles
return np.array(estimates)
np.random.seed(42)
true_temps = 20 + np.cumsum(np.random.randn(30) * 0.5)
noisy_measurements = true_temps + np.random.randn(30) * 3.0
pf_estimates = particle_filter_1d(noisy_measurements)
mse = np.mean((pf_estimates - true_temps)**2)
print(f"Particle filter MSE: {mse:.2f}")
That particle filter does the same job as the Kalman filter above, but without assuming anything about the shape of the posterior. For this linear-Gaussian problem, the Kalman filter will actually win — it’s exact, while the particle filter is an approximation. The particle filter’s advantage appears when the Kalman filter’s assumptions break down.
The family tree: HMMs, Kalman filters, and particle filters are all doing the same thing: recursive Bayesian estimation. Prior (predict) → likelihood (observe) → posterior (update) → becomes prior for next step. HMMs do it with discrete states and exact enumeration. Kalman does it with continuous states and Gaussian math. Particles do it with samples and no assumptions. Same loop, different representations.
Probabilistic programming languages
Everything we’ve covered so far required either hand-implementing the algorithms (tedious) or using specialized libraries for each model type (HMM library for HMMs, Kalman library for Kalman filters). What if you want a model that’s a bit of both? Or a model that doesn’t fit neatly into any of these categories?
This is where probabilistic programming languages (PPLs) change the game. A PPL lets you write your model as code — you describe the generative story of how the data was produced — and the PPL automatically figures out how to do inference. You specify the forward story: “first draw the parameters from their priors, then generate observations from the likelihood.” The PPL inverts this: given the observed data, it computes the posterior over parameters.
I think of it like this. Building an HMM from scratch is like building a custom engine for a specific car. Using a PPL is like having a factory that can build any engine from a blueprint. You hand it the blueprint (your model specification), and it handles the manufacturing (inference).
The four main PPLs in the Python ecosystem each have a different personality.
Writing a model in PyMC
PyMC (version 5+) is the most accessible entry point. It’s pure Python, built on PyTensor, and uses the NUTS sampler by default — a sophisticated variant of Hamiltonian Monte Carlo that automatically tunes itself. It has excellent documentation, a large community, and tight integration with ArviZ for diagnostics and visualization.
Let’s go back to our ice cream shop, but now with a twist. Instead of a fixed HMM, we want to estimate how temperature affects sales, accounting for uncertainty. We observe daily ice cream sales and daily (noisy) temperature readings. We believe sales depend linearly on temperature, but we don’t know the slope or intercept. We want the full posterior distribution over those parameters — not point estimates, but distributions that capture our uncertainty.
import pymc as pm
import numpy as np
import arviz as az
np.random.seed(42)
# Simulate: temperature drives sales, but both are noisy
n_days = 100
true_temp = 15 + np.random.randn(n_days) * 5
true_sales = 10 + 2.5 * true_temp + np.random.randn(n_days) * 8
with pm.Model() as ice_cream_model:
# Priors: what we believe before seeing data
intercept = pm.Normal('intercept', mu=0, sigma=20)
slope = pm.Normal('slope', mu=0, sigma=5)
noise = pm.HalfNormal('noise', sigma=10)
# Likelihood: how data is generated
expected_sales = intercept + slope * true_temp
sales = pm.Normal('sales', mu=expected_sales, sigma=noise,
observed=true_sales)
# Let PyMC figure out the posterior
trace = pm.sample(2000, tune=1000, random_seed=42, cores=2)
summary = az.summary(trace, var_names=['intercept', 'slope', 'noise'])
print(summary)
print(f"\nTrue intercept: 10, True slope: 2.5, True noise: 8")
Three things to notice. We wrote the forward story — priors, then likelihood — in about ten lines. We never mentioned MCMC, proposal distributions, or acceptance ratios. And the posterior summary gives us not a single number but a full distribution for each parameter, with credible intervals and convergence diagnostics. That’s the power of a PPL: you think about the model, and the framework thinks about the inference.
Stan, NumPyro, and Pyro
Stan is the gold standard for production Bayesian inference. It compiles your model to C++ and runs a highly optimized NUTS sampler. The diagnostics are the best in the business. The trade-off: you write models in Stan’s own language, not Python, and compilation takes seconds to minutes. Use Stan when PyMC’s sampler struggles on complex geometries, or when publishing in a statistics journal — Stan is the de facto standard there.
NumPyro is built on JAX and JIT-compiles your model to run NUTS on GPU. For large datasets or models with many parameters, NumPyro can be 10–100x faster than PyMC. The syntax is similar to PyMC but uses JAX’s functional style. Use NumPyro when you need speed or plan to integrate with JAX-based deep learning.
Pyro is built on PyTorch and designed for deep probabilistic models. It excels at stochastic variational inference and integrates naturally with PyTorch neural networks. Use Pyro when your model involves neural networks with Bayesian components, or when you need the full flexibility of PyTorch’s autograd.
Here’s the same ice cream model in NumPyro, to show the flavor difference:
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS
import jax.numpy as jnp
import jax.random as random
def ice_cream_model(temperature, observed_sales=None):
intercept = numpyro.sample('intercept', dist.Normal(0, 20))
slope = numpyro.sample('slope', dist.Normal(0, 5))
noise = numpyro.sample('noise', dist.HalfNormal(10))
expected = intercept + slope * temperature
numpyro.sample('sales', dist.Normal(expected, noise),
obs=observed_sales)
kernel = NUTS(ice_cream_model)
mcmc = MCMC(kernel, num_warmup=1000, num_samples=2000)
mcmc.run(random.PRNGKey(42), temperature=jnp.array(true_temp),
observed_sales=jnp.array(true_sales))
mcmc.print_summary()
The model specification reads almost identically. The main difference is the functional style — the model is a plain Python function, not a context manager. Under the hood, NumPyro is compiling to XLA and can leverage GPUs. For our 100-day toy example the difference doesn’t matter, but when you have millions of data points or hundreds of parameters, it matters a lot.
My advice: start with PyMC. It has the gentlest learning curve and covers 80% of use cases. Move to Stan when you hit edge cases. Move to NumPyro when you need speed. Move to Pyro when you need deep learning integration.
Posterior predictive checks
You’ve specified a model and sampled the posterior. How do you know if the model is any good? This is where posterior predictive checks come in, and I’ll be honest — this idea was a genuine revelation for me the first time I encountered it.
The idea: take the posterior samples of your parameters and use them to generate fake data from the model. Then compare the fake data to your real data. If the model is capturing the data-generating process well, the fake data should look like the real data. If there are systematic differences — the fake data is too symmetric, or doesn’t have enough extreme values, or has the wrong correlation structure — your model is missing something important.
Back to our ice cream shop. We fit the linear model above. Now we draw, say, 500 sets of simulated sales data from the posterior. Each set uses different parameter values drawn from the posterior. We overlay these simulated distributions on the actual observed sales. If the observed data lives comfortably inside the simulated cloud, the model is plausible. If the observed data is an outlier from the simulated cloud, something is wrong.
with ice_cream_model:
ppc = pm.sample_posterior_predictive(trace, random_seed=42)
# ArviZ makes this visual comparison easy
az.plot_ppc(az.from_pymc3(trace=trace,
posterior_predictive=ppc))
Posterior predictive checks are not a pass/fail test. They’re a diagnostic conversation. All models are wrong — the question is whether the model is wrong in ways that matter for your specific use case. A model that gets the mean right but misses heavy tails might be fine for estimating average sales but terrible for inventory planning.
Model comparison — WAIC and LOO-CV
Posterior predictive checks tell you if one model is reasonable. But what if you have two reasonable models and want to pick the better one? You need a way to compare their predictive performance.
The frequentist approach would be something like AIC or BIC. The Bayesian equivalents are WAIC (Widely Applicable Information Criterion) and LOO-CV (Leave-One-Out Cross-Validation). Both estimate how well the model will predict new data, not the data it was trained on.
WAIC computes the expected log pointwise predictive density — a measure of how well each observation is predicted by the model — and penalizes for complexity using the variance of the log-likelihood across posterior samples. Higher variance means the model is more sensitive to exactly which parameter values you use, which is a sign of overfitting.
LOO-CV takes a more direct approach: for each data point, it asks how well the model predicts that point when it wasn’t used for fitting. In principle this requires refitting the model N times (once for each left-out observation), which would be impossibly slow. In practice, it uses a clever importance sampling trick called Pareto Smoothed Importance Sampling (PSIS) that approximates the leave-one-out posteriors from a single fit. When the approximation is good (the Pareto k diagnostic is below 0.7), you get reliable LOO-CV estimates for free.
import arviz as az
# Compare two models: linear vs quadratic temperature effect
# (Assume we've fit both and have their traces)
comparison = az.compare({
'linear': trace_linear,
'quadratic': trace_quadratic
}, ic='loo')
print(comparison)
LOO-CV is generally preferred over WAIC because it’s more robust to model misspecification and has better diagnostics (the Pareto k values tell you when the approximation breaks down). The ArviZ library computes both from a single set of posterior samples, so there’s little reason not to use LOO as the default.
One thing that tripped me up early: lower WAIC and LOO (or equivalently, higher ELPD) means better predictive performance. The sign convention is confusing because some packages report the information criterion (lower is better) and others report the expected log predictive density (higher is better). Always check which one you’re looking at.
Bayesian neural networks
A standard neural network has fixed weights after training. It makes a prediction, and that prediction is a single number (or a single probability). It doesn’t tell you how confident it is.
A Bayesian neural network (BNN) puts probability distributions over the weights instead of point values. Instead of one set of weights, you have a posterior distribution over all possible weight configurations, given the training data. When you make a prediction, you sample many sets of weights, run the input through each, and look at how much the predictions vary. High variation means high uncertainty.
Return to our ice cream shop one more time. A standard neural network predicting sales from temperature would give you a single number: “42 cones.” A Bayesian neural network would give you: “42 ± 12 cones, and by the way, at 45°C the uncertainty balloons because we’ve never seen data that hot.”
This matters enormously in high-stakes domains. A medical diagnosis model that says “90% probability of benign” is very different from one that says “90% probability of benign, but my uncertainty is huge because this looks nothing like my training data.” The first gives you a number. The second gives you a number and tells you whether to trust it.
In practice, full BNNs are expensive — maintaining and sampling from distributions over millions of weights is computationally brutal. Several approximations exist. MC Dropout is the cheapest: keep dropout active at prediction time, run the input through the network multiple times, and treat the variation as uncertainty. It’s not a true posterior, but it works surprisingly well as a practical approximation. Variational inference (via Pyro or NumPyro) is more principled: it fits a variational distribution to the posterior over weights. Deep ensembles — training multiple independent networks and looking at their disagreement — are technically not Bayesian, but they provide excellent uncertainty estimates in practice.
Uncertainty quantification
Bayesian neural networks bring us to the broader topic of uncertainty quantification, which is arguably the single most important practical contribution of Bayesian methods to machine learning.
There are two fundamentally different kinds of uncertainty, and confusing them leads to bad decisions.
Aleatoric uncertainty is noise inherent in the data. Even with a perfect model and infinite training data, you can’t predict exactly how many ice cream cones you’ll sell tomorrow. The weather has a random component. Customer behavior is stochastic. This uncertainty is irreducible — no amount of data collection will eliminate it. A well-calibrated model should report it honestly.
Epistemic uncertainty is uncertainty from lack of knowledge. If you’ve only ever observed sales on days between 10°C and 30°C, your model has no idea what happens at 40°C. This uncertainty is reducible — collect data at 40°C and it goes away. Epistemic uncertainty is what Bayesian methods capture through the posterior distribution: the spread of the posterior is widest in regions where data is sparse.
A well-calibrated uncertainty system reports both. It says: “For tomorrow at 20°C, I predict 42 ± 8 cones (mostly aleatoric noise), but for next week at 38°C, I predict 55 ± 25 cones (mostly epistemic uncertainty because I’ve never seen that temperature).”
In production systems, epistemic uncertainty is the more actionable one. High epistemic uncertainty means the model is extrapolating — it’s making stuff up. That’s when you should fall back to a simpler model, flag the prediction for human review, or collect more data in that region.
My favorite thing about this distinction is that, aside from high-level explanations like the one above, no one is completely certain how to perfectly disentangle the two in all situations. Active research is still figuring out the best decomposition methods. But even an approximate decomposition is enormously more useful than no uncertainty at all.
If you’re still with me, thank you. I hope it was worth it.
We started with a windowless ice cream shop, trying to guess the weather from sales numbers. That led us to Hidden Markov Models and the forward, backward, and Viterbi algorithms. We moved to continuous states with Kalman filters and the predict-update cycle. When those assumptions broke, we scattered particles across the state space. Then we stepped up to probabilistic programming languages that let us write any model we can imagine and have the computer handle the inference. And we ended with the question that matters most in production: how uncertain is the model, and should we trust it?
My hope is that the next time you encounter a problem involving hidden states, noisy observations, or uncertain predictions — instead of reaching for a black-box neural network and hoping for the best — you’ll consider whether a Bayesian approach might give you not only an answer, but also an honest assessment of how much that answer is worth.
Resources and credits
- Rabiner (1989), “A Tutorial on Hidden Markov Models” — The O.G. HMM tutorial. Dense but comprehensive. If you read one paper on HMMs, make it this one.
- Kalman (1960), “A New Approach to Linear Filtering and Prediction Problems” — The paper that launched a thousand applications. Remarkably readable for a 1960 paper.
- Doucet, de Freitas, & Gordon (2001), “Sequential Monte Carlo Methods in Practice” — The definitive reference on particle filters. More textbook than tutorial, but wildly thorough.
- McElreath, “Statistical Rethinking” (2nd ed.) — The best introduction to Bayesian modeling with PPLs. Conversational, opinionated, and unforgettable. Chapters 6–7 cover model comparison beautifully.
- PyMC documentation & examples gallery — Insightful, well-maintained, and the fastest way to go from “I want to try Bayesian modeling” to “I have a working model.”
- Gal (2016), “Uncertainty in Deep Learning” (PhD thesis) — The foundational work on MC Dropout as approximate Bayesian inference. Made BNNs practical for the rest of us.