Gaussian Processes & Bayesian Optimization

Chapter 14: Probabilistic & Bayesian ML 10 subtopics

I avoided Gaussian Processes for an embarrassingly long time. Every time I saw the phrase “distribution over functions” in a paper, I’d nod politely and keep scrolling, privately convinced that only people who dream in measure theory could actually use them. Then I found myself on a project where I had exactly 47 data points, management wanted uncertainty bars on every prediction, and the neural network I’d been planning to use started to look absurd. The discomfort of not knowing what a GP actually does grew too great. Here is that dive.

Gaussian Processes emerged from geostatistics in the 1960s (where they went by the name kriging, after South African mining engineer Danie Krige) and were formalized for machine learning largely through Carl Rasmussen and Christopher Williams’ 2006 textbook. They give you regression and classification with built-in uncertainty estimates, and they power Bayesian Optimization — the most sample-efficient method we have for tuning expensive black-box functions.

Before we start, a heads-up. We’re going to touch on multivariate Gaussians, matrix inversions, and kernel functions. You don’t need to know any of that beforehand. We’ll add what we need, one piece at a time.

This isn’t a short journey, but I hope you’ll be glad you came.

Contents

The Weather Station Problem

What “Distribution Over Functions” Actually Means

The Kernel: Your Only Modeling Decision

GP Regression, Step by Step

GP Classification

Rest Stop

The O(n³) Problem and Sparse GPs

Bayesian Optimization: Searching Smartly

Acquisition Functions: Where to Look Next

Multi-Fidelity Bayesian Optimization

Wrap-Up

Resources

The Weather Station Problem

Imagine you’re a meteorologist. You have three weather stations scattered across a valley. Station A, near the lake, reads 18°C. Station B, up on the ridge, reads 12°C. Station C, in the town center, reads 20°C. A farmer calls and asks: “What’s the temperature at my field, which is 3 km east of Station A?”

You could fit a straight line through the three readings. Or a parabola. Or a wiggly polynomial that passes through all three points exactly. Each choice is a commitment to a specific shape, and with only three data points, that commitment feels premature. What you’d really like to say is: “Based on these three readings and the assumption that temperature changes smoothly over short distances, the field is probably around 17°C — but I’m not very sure. If you want more confidence, give me more stations.”

That’s what a Gaussian Process does. It doesn’t pick a single curve. It considers all smooth curves that pass near your data, and it tells you which ones are more plausible than others. Where data is dense, the plausible curves agree with each other and uncertainty is low. Where data is sparse, the curves fan out and uncertainty is high.

We’ll use this valley and its weather stations as our running example. It starts with three stations. By the end, it’ll have taught us everything we need.

What “Distribution Over Functions” Actually Means

The phrase “distribution over functions” sounds like something that should come with a warning label. Let me try to unpack it in a way that actually helped me.

Start with a single random variable. A Gaussian distribution over it gives you a bell curve: some values are more likely, some less. Now take two random variables together — say, the temperature at Station A and Station B. A multivariate Gaussian distribution gives you a two-dimensional bell shape, and crucially, it encodes how the two temperatures correlate. If Station A is warm, Station B is probably warm too (they’re in the same valley), and the multivariate Gaussian captures that.

Now take three random variables — temperatures at A, B, and C. Same idea, but in three dimensions. The multivariate Gaussian still works, it’s specified by a 3-element mean vector and a 3×3 covariance matrix.

Here’s the leap. Imagine you could have a weather station at every point in the valley — an infinite number of locations. You’d have an infinite number of random variables (one temperature per location), and you’d want a Gaussian distribution over all of them simultaneously. That’s a Gaussian Process. It’s the infinite-dimensional extension of the multivariate Gaussian. Pick any finite collection of locations, and the temperatures at those locations follow a multivariate Gaussian distribution.

A GP is fully specified by two things. A mean function m(x) that says what temperature you’d expect at location x before seeing any data (in practice, we usually set this to zero and let the data speak). And a kernel function k(x, x′) that says how correlated the temperatures at two locations should be. The kernel is where all the interesting modeling decisions live.

f(x) ~ GP(m(x), k(x, x'))

Pick any finite set of locations x₁, x₂, ..., xₙ:
[f(x₁), f(x₂), ..., f(xₙ)] ~ Normal(μ, K)

where μᵢ = m(xᵢ)  and  Kᵢⱼ = k(xᵢ, xⱼ)

Think of it this way: a regular regression model says “the answer is this line.” A GP says “the answer is probably somewhere in this bundle of curves, and here’s how likely each one is.” Every curve in that bundle is a function. The GP assigns probabilities to all of them. That’s the distribution over functions.

I’ll be honest — the first time I read this, it felt like a clever mathematical trick that couldn’t possibly be practical. It is. And the reason it works is entirely because of the kernel.

The Kernel: Your Only Modeling Decision

In most machine learning, you choose an architecture: how many layers, how many neurons, what activation functions. In a GP, you choose a kernel (also called a covariance function). That’s it. The kernel encodes everything you believe about the function before seeing data — how smooth it is, how quickly it varies, whether it repeats.

The kernel k(x, x′) takes two input locations and returns a number that says how correlated their outputs should be. If x and x′ are close together and the kernel returns a large value, the GP “believes” that nearby points have similar function values. That means smooth functions. If the kernel drops off quickly with distance, the GP allows more wiggly behavior.

Back to our valley: if we believe temperature changes gradually over kilometers, we want a kernel where points a few hundred meters apart are highly correlated, points 10 km apart are weakly correlated, and points 50 km apart are essentially independent. That belief, encoded as a function, is the kernel.

The RBF (Squared Exponential) Kernel

The most common starting point. Its formula is k(x, x′) = σ² exp(−||x − x′||² / 2l²). The correlation between two points decays as a Gaussian bell curve with distance. The parameter l is the lengthscale — it controls how far apart two points can be and still be correlated. Short l means wiggly functions. Long l means gentle, sweeping curves.

Functions drawn from an RBF kernel are infinitely differentiable — smoother than any real physical process. That sounds like an asset, but in practice it can be a liability. The RBF assumes the world is very smooth, and real data often isn’t. I’ve seen RBF-based GPs produce unrealistically confident predictions between data points because they refuse to wiggle.

The Matérn Kernel

The Matérn family fixes the over-smoothness problem by adding a parameter ν that controls how many times the function is differentiable. Matérn-3/2 (ν = 3/2) gives functions that are once differentiable — smooth, but allowed to have corners. Matérn-5/2 (ν = 5/2) gives twice-differentiable functions. As ν goes to infinity, the Matérn converges to the RBF.

In practice, Matérn-5/2 has become the default recommendation for most applications. It’s smooth enough to produce nice predictions, rough enough to not be unrealistic. If you remember one thing from this section: start with Matérn-5/2, not RBF.

The Periodic Kernel

Some functions repeat. Daily temperature cycles. Weekly sales patterns. Heart rhythms. The periodic kernel has the form k(x, x′) = σ² exp(−2 sin²(π|x − x′|/p) / l²), where p is the period. Two points exactly one period apart are treated as if they were right next to each other.

Here’s where it gets fun: kernels can be combined. Add a periodic kernel to a Matérn kernel and you get a function that has both a repeating seasonal pattern and a long-term trend. Multiply a periodic kernel by an RBF and you get a pattern that repeats but whose amplitude changes over time. Kernel composition is how GPs express complex structural beliefs — and it’s far more interpretable than stacking neural network layers.

Every kernel has hyperparameters (lengthscale, variance, period). These are learned from data by maximizing the marginal likelihood — the probability of the observed data given the kernel and its hyperparameters, integrated over all possible functions. This is one of the genuinely elegant parts of the GP framework: the model automatically balances fit and complexity, with no need for a separate validation set.

But this elegance has a cost, and we’ll get to that cost soon.

GP Regression, Step by Step

Let’s bring this back to our weather valley and walk through exactly what happens when a GP makes a prediction. No hand-waving.

We have three stations. Their positions along the valley (in km) are x = [2, 5, 8] and their temperature readings are y = [18, 12, 20]. We pick a Matérn-5/2 kernel with lengthscale l = 3 and variance σ² = 1, plus some observation noise σ_n² = 0.1 to account for thermometer imprecision.

First, we build the kernel matrix K. This is a 3×3 matrix where entry K_ij = k(x_i, x_j). Diagonal entries are k(x, x) = σ² = 1 (each station is perfectly correlated with itself). Off-diagonal entries measure how correlated each pair of stations is — stations 2 km apart get a high value, stations 6 km apart get a lower value. We add noise: K_noisy = K + σ_n² I.

Now the farmer calls and asks about position x* = 5.5 (a point between Stations B and C). We compute k*, a vector of correlations between x* and each training station. Station B at x = 5 is closest, so its entry in k* is largest.

The predictive mean — our best guess of the temperature — comes from a beautiful formula:

μ* = k*ᵀ (K + σ²ₙ I)⁻¹ y

Read it right to left. We take the training temperatures y, weight them by how “important” each station is (that’s what the matrix inverse does), then combine those weights with how correlated x* is with each station. Stations that are close to x* and that carry non-redundant information get the most weight. The result is a single number: our predicted temperature.

The predictive variance — our uncertainty — comes from:

σ²* = k** − k*ᵀ (K + σ²ₙ I)⁻¹ k*

Start with the prior variance k** (how uncertain we were before seeing any data), then subtract the variance reduction from observing the training data. Near a station, the reduction is large and uncertainty is small. Far from any station, the reduction is small and uncertainty is large. This is the behavior that makes GPs so appealing for decision-making: the model tells you where it knows things and where it’s guessing.

The entire posterior — the updated GP after seeing data — is itself a GP. That means we can predict at any number of new locations and get a full multivariate Gaussian distribution over all of them. The predictions at nearby locations are correlated, so the uncertainty band is smooth and continuous, not a jagged mess of independent error bars.

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel

# Our three weather stations
X_train = np.array([[2], [5], [8]])
y_train = np.array([18, 12, 20])

# Matérn-5/2 kernel + noise term
kernel = Matern(nu=2.5, length_scale=3.0) + WhiteKernel(noise_level=0.1)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=5)
gp.fit(X_train, y_train)

# Predict across the valley
X_test = np.linspace(0, 10, 200).reshape(-1, 1)
y_pred, y_std = gp.predict(X_test, return_std=True)

# y_pred[i]  = predicted temperature at X_test[i]
# y_std[i]   = uncertainty (one standard deviation)
# Near stations: y_std is small.  Far from stations: y_std grows.

That code is the whole thing. Fit three points, get predictions with uncertainty everywhere. No architecture selection, no training epochs, no learning rate. The kernel is the model.

GP Classification

So far we’ve been predicting continuous temperatures. But what if our weather stations recorded something binary — say, whether it rained (1) or not (0) at each location? We still want to predict the probability of rain at the farmer’s field, with uncertainty.

The trouble is that a Gaussian distribution is defined over real numbers, from −∞ to +∞. Probabilities live between 0 and 1. We need a bridge.

GP classification introduces a latent function f(x) that still has a GP prior — it can take any real value. Then we push f(x) through a squashing function (a sigmoid or a probit, which is the Gaussian CDF) to get a probability: p(rain | x) = σ(f(x)). If the latent function is large and positive at location x, the probability of rain is high. If it’s large and negative, the probability is low. If it’s near zero, we’re uncertain.

There’s a catch, and it’s a real one. In GP regression, the Gaussian likelihood plays nicely with the Gaussian prior — the posterior is available in closed form (that’s the formula we derived above). In classification, the likelihood is Bernoulli, not Gaussian, and the posterior is no longer tractable. We have to approximate it.

The two main approximation strategies are the Laplace approximation (find the mode of the posterior and fit a Gaussian around it — fast but sometimes inaccurate) and Expectation Propagation (iteratively refine a Gaussian approximation to each likelihood term — slower but more accurate). Variational inference is a third option, and it’s increasingly popular because it scales better.

I’m still developing my intuition for when each approximation matters. In my experience, the Laplace approximation works fine for well-separated classes, but if the decision boundary is subtle, Expectation Propagation gives noticeably better calibrated uncertainties.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a solid mental model: a GP defines a distribution over functions, controlled by a kernel that encodes smoothness assumptions. Conditioning on data gives you a posterior GP with predictions and uncertainty. For classification, we push through a sigmoid and approximate. This mental model is enough to use GPs in practice for small datasets where you need principled uncertainty.

It doesn’t tell the full story, though. We haven’t confronted the computational elephant in the room — the fact that GPs become agonizingly slow as data grows. We haven’t talked about how to use GPs as the engine for Bayesian Optimization, which is arguably where they have the most impact in modern ML practice. And we haven’t discussed how to make BO work when even single evaluations are expensive but you have access to cheap approximations.

The short version: GPs scale as O(n³), sparse approximations fix this, and Bayesian Optimization uses GPs to search for optima of expensive functions with remarkable efficiency. There. You’re 70% of the way there.

But if the discomfort of not knowing what’s underneath is nagging at you, read on.

The O(n³) Problem and Sparse GPs

Here’s the part where the elegance of GPs collides with reality. That beautiful prediction formula has (K + σ²I)&supmin;¹ in it — the inverse of an n×n matrix, where n is the number of training points. Matrix inversion is O(n³) in time and O(n²) in memory. For our three weather stations, this is trivial. For 1,000 stations, it takes a few seconds. For 10,000 stations, it takes minutes. For 100,000, it’s impractical on most hardware.

To make this concrete: if a CPU can do about 10 billion floating-point operations per second, inverting a 10,000×10,000 matrix takes around 100 seconds. A 100,000×100,000 matrix? About 100,000 seconds — more than a day. The cubic scaling is brutal.

The solution that has gained the most traction is sparse GPs with inducing points. The idea: instead of using all n training points to define the GP posterior, select a small set of m points (called inducing points or pseudo-inputs, where m is much smaller than n) and approximate the full GP using those points. Think of it as placing m “summary stations” in the valley that capture the essential behavior of all n stations.

The inducing points don’t have to be actual data points. Their locations and associated function values are learned as part of the optimization, typically by maximizing a variational lower bound on the marginal likelihood (an approach formalized by Titsias in 2009). This reduces the computational cost from O(n³) to O(nm²) — a massive improvement when m is, say, 500 and n is 100,000.

A second approach is Random Fourier Features. The insight is that certain kernels (RBF, Matérn) can be approximated by random trigonometric features. Once you compute these features, the GP reduces to Bayesian linear regression, which is O(nd²) where d is the number of features. This trades kernel fidelity for speed.

In practice, GPyTorch (built on PyTorch) and GPflow (built on TensorFlow) are the libraries that make sparse GPs and GPU-accelerated inference accessible. GPyTorch in particular has pushed GP inference to datasets with hundreds of thousands of points, using a combination of inducing points, conjugate gradients, and GPU parallelism. It’s not magic — the approximation does lose some fidelity, especially in the uncertainty estimates — but for most practical purposes, it works remarkably well.

I’ll be honest: the first time I saw a sparse GP fit 50,000 points in under a minute and produce sensible uncertainty bands, I was surprised. The approximation loss was far smaller than I expected. But there are edge cases — multi-modal functions, data with sharp transitions — where the inducing point approximation can miss structure. It’s good engineering, not a free lunch.

Bayesian Optimization: Searching Smartly

Here’s where GPs earn their keep in modern ML practice. Not as a standalone regression model (neural networks have largely eaten that market), but as the engine inside Bayesian Optimization.

The problem: you have a function that’s expensive to evaluate. Training a deep neural network with a specific set of hyperparameters takes 4 hours. Running a physics simulation takes 6 hours. Synthesizing a candidate molecule in a wet lab takes a week. You want to find the input that maximizes (or minimizes) this function, and you can’t afford 10,000 evaluations. Maybe you can afford 50.

Grid search divides the space into a regular grid and evaluates every point. For 5 hyperparameters with 10 values each, that’s 100,000 evaluations — at 4 hours each, roughly 45 years. Random search is better (it doesn’t waste evaluations on unimportant dimensions), but it doesn’t learn from its past evaluations. Each new trial is as uninformed as the first.

Bayesian Optimization learns. It maintains a surrogate model — a cheap-to-evaluate approximation of the expensive function — and uses it to decide where to evaluate next. The surrogate is typically a GP, because GPs provide exactly what we need: a prediction of the function value at any point, plus an uncertainty estimate that tells us how much we might learn by evaluating there.

The loop looks like this. We start with a few initial evaluations (often random). We fit a GP to those evaluations. We then ask: “given what I know so far, where should I evaluate next to make the most progress?” That question is answered by an acquisition function, which we maximize over the cheap GP surrogate instead of the expensive real function. We evaluate the real function at the point the acquisition function recommends, add the result to our data, update the GP, and repeat.

Back to our valley analogy: imagine you’re drilling for water, and each bore hole costs $10,000. After three holes (our weather stations), the GP tells you both where water is likely (high predicted value) and where you have the most uncertainty (high variance). The acquisition function combines these signals to pick the fourth drilling location. After 20 strategic holes, you’ve found the water table’s peak — something that random drilling might have needed 200 holes to achieve.

Acquisition Functions: Where to Look Next

The acquisition function is the decision-making engine of Bayesian Optimization. It takes the GP’s posterior (mean prediction μ(x) and uncertainty σ(x) at every point) and produces a single score for each candidate location. The location with the highest score is where we evaluate next.

Every acquisition function faces the same tension: exploitation (evaluate where the GP predicts good values) versus exploration (evaluate where the GP is uncertain, because we might discover something better). Lean too far toward exploitation and you get stuck in a local optimum. Lean too far toward exploration and you waste evaluations on irrelevant regions.

Expected Improvement (EI)

The most popular acquisition function, and a good default. EI asks: “If I evaluate at point x, how much improvement over my current best result do I expect?” The expectation is taken over the GP’s posterior distribution at x.

Consider our drilling analogy. Say our best hole so far hit water at depth 15 meters. At a new candidate location, the GP predicts a mean of 13 meters with standard deviation 5 meters. There’s a reasonable probability the true depth exceeds 15 — the bell curve extends well past it. EI computes the average of all the ways this point could beat 15, weighted by their probabilities. If the mean is below 15 but the uncertainty is large, there’s still significant expected improvement — we might get lucky. That’s how EI naturally balances exploitation and exploration.

EI has a closed-form solution (involving the Gaussian PDF and CDF), which makes it fast to compute. In practice, I’ve found it works well in 80% of tuning problems without any fiddling.

Upper Confidence Bound (UCB)

UCB takes a more direct approach: UCB(x) = μ(x) + κσ(x). The predicted mean plus a multiple of the uncertainty. The parameter κ is a knob you can turn. Set κ high and the acquisition function strongly favors uncertain regions (exploration). Set κ low and it favors regions where the mean is already high (exploitation).

I like UCB because the exploration-exploitation tradeoff is explicit and tunable. If you’re early in the search and want to explore broadly, crank κ up to 2 or 3. If you’re running out of budget and want to refine around the best region, drop it to 0.5. EI doesn’t give you that dial.

Probability of Improvement (PI)

PI answers the question: “What is the probability that this point beats my current best?” It’s the simplest of the three, but it has a well-known flaw: it’s too greedy. PI treats a point that beats the current best by 0.001 the same as a point that beats it by 10. It chases high-probability, low-magnitude improvements and tends to get stuck near the current best. In my experience, it’s rarely the right choice for problems where the budget matters.

For most hyperparameter tuning: start with EI. If you suspect multiple good regions in the search space, switch to UCB with a moderate κ. PI exists for theoretical completeness, but I’d reach for it last.

BO for Hyperparameter Tuning: A Worked Example

Let’s see this in practice. We’ll tune an XGBoost classifier with Bayesian Optimization using Optuna. Optuna’s default surrogate is a Tree-structured Parzen Estimator (TPE), not a GP — TPE handles mixed continuous/categorical spaces better than GPs do, which is why it’s the practical default. The core idea is the same though: a surrogate model guides the search.

import optuna
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
    }
    clf = XGBClassifier(**params, eval_metric='logloss', random_state=42)
    return cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Each of those 50 trials is informed by all previous trials. Optuna builds a surrogate, decides which hyperparameter combination is most promising, evaluates it, and updates. Fifty intelligent trials often outperform 500 random ones. That’s the power of Bayesian Optimization — not faster individual evaluations, but far fewer wasted evaluations.

For GP-based BO specifically (say, when your search space is purely continuous), BoTorch (Meta’s library built on GPyTorch) gives you full control over the GP surrogate and acquisition function. It’s more complex to set up, but it’s the tool of choice for research and custom optimization problems.

Multi-Fidelity Bayesian Optimization

There’s one more problem worth confronting. Even with Bayesian Optimization choosing evaluation points intelligently, each evaluation might still take hours. Training a neural network for 100 epochs takes a long time. But training it for 5 epochs? That’s fast, and it gives you a rough estimate of how well those hyperparameters will perform.

Multi-fidelity Bayesian Optimization exploits this idea: instead of always evaluating the full expensive function, use cheap low-fidelity approximations (fewer epochs, less data, lower resolution) to weed out bad candidates quickly, and reserve expensive high-fidelity evaluations for the most promising ones.

The most influential algorithm in this space is Hyperband, which runs multiple rounds of Successive Halving. Start with many hyperparameter configurations, each given a tiny resource budget (say, 1 epoch). Evaluate all of them. Throw away the worst half. Double the budget for the survivors. Repeat until one remains. Hyperband runs several of these races with different starting configurations and budget allocations, automatically balancing between evaluating many configurations cheaply and few configurations expensively.

BOHB (Bayesian Optimization + HyperBand) combines Hyperband’s resource allocation with a TPE surrogate for choosing which configurations to try. Instead of picking configurations randomly (as vanilla Hyperband does), BOHB uses the surrogate model to propose configurations that are more likely to be good. It gets the sample efficiency of Bayesian Optimization and the resource efficiency of early stopping.

This matters because it changes the economics of hyperparameter tuning. Without multi-fidelity, a 50-trial BO search where each trial takes 4 hours costs 200 hours. With BOHB, most configurations are killed after a few minutes, and only the top candidates get the full 4-hour treatment. The total cost might drop to 20–30 hours for the same quality result.

My favorite thing about multi-fidelity BO is that it matches how humans actually tune models. You train for a few epochs, glance at the loss curve, mutter “that’s terrible,” and kill it. BOHB automates that instinct with a principled framework underneath.

Wrap-Up

If you’re still with me, thank you. I hope it was worth it.

We started with three weather stations in a valley and the question “what’s the temperature at the farmer’s field?” To answer it, we built up from a single Gaussian to a multivariate Gaussian to a Gaussian Process — an infinite-dimensional distribution over functions. We learned that the kernel is the entire model, walked through the posterior prediction formula, saw how classification requires approximate inference, confronted the O(n³) scaling wall and how sparse GPs break through it, then used everything we built to power Bayesian Optimization: the art of finding optima with as few expensive evaluations as possible. We ended with multi-fidelity methods that make even BO more efficient by using cheap approximations to kill bad candidates early.

My hope is that the next time you see a Gaussian Process in a paper or hear someone mention Bayesian Optimization for hyperparameter tuning, instead of nodding politely and scrolling past, you’ll have a pretty good mental model of what’s going on under the hood — a bundle of curves, a kernel that shapes them, and an acquisition function that points toward where to look next.

Resources

Rasmussen & Williams, “Gaussian Processes for Machine Learning” (2006) — The O.G. textbook. Available free online. Dense but comprehensive. Chapter 2 alone is worth the read.
Distill.pub, “A Visual Exploration of Gaussian Processes” — Wildly helpful interactive article. If you want to see what kernels do to the function distribution, this is the one.
BoTorch documentation (botorch.org) — Meta’s library for GP-based Bayesian Optimization. Excellent tutorials for custom acquisition functions.
Optuna documentation (optuna.org) — The practical default for hyperparameter tuning. Their “10 minutes to Optuna” tutorial is accurate about the time commitment.
Falkner et al., “BOHB: Robust and Efficient Hyperparameter Optimization at Scale” (2018) — The paper that married Bayesian Optimization with Hyperband. Insightful and well-written.
GPyTorch (gpytorch.ai) — If you need GPs at scale, this is the library. GPU-accelerated, supports inducing points, and integrates with BoTorch.

← Previous Probabilistic Programming Next → Nice to Know