Mathematical Notation

Chapter 2: Mathematical Foundations 12 subtopics
TL;DR

Mathematical notation is a compressed language — every symbol maps to something you already understand in code. About a dozen notational families cover 95% of ML papers: Greek letters for parameters, subscripts for indexing, Σ/Π for loops, ∇ for gradients, ‖·‖ for norms, and p(x|y) for conditional probabilities. We'll decode them all by reading the self-attention formula symbol by symbol. By the end, Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V will read like Python.

I skipped every ML paper with more than two Greek letters for years. I'd open a PDF, see a wall of sigmas and thetas and subscripted subscripts, and close the tab. I'd tell myself I'd "come back to it later." Later never came. Eventually the discomfort of nodding along in meetings while having no idea what the actual math said grew too great. Here is that dive.

Mathematical notation in ML papers is, at its core, a compression scheme. Researchers developed it over centuries to express complex ideas compactly — the same way we compress code into functions and abstractions. The problem is that nobody hands you the decompression algorithm. You're expected to have absorbed it through years of math courses that many practicing engineers either skipped or forgot. The notation itself isn't hard. The lack of a Rosetta Stone is what makes it feel hard.

Before we start, a heads-up. We're going to work through a lot of symbols — Greek letters, special fonts, operators, decorators. You don't need to know any of them beforehand. We'll add each one as we need it, with explanation, and by the end we'll read a real formula from the transformer paper together. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The Running Example: Decoding the Attention Formula

Here's the formula we're building toward. It's from the 2017 paper "Attention Is All You Need," the paper that launched the transformer revolution:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Right now, that might look like alphabet soup. Capital letters that might be matrices or might be sets. A superscript T that could be an exponent or something else entirely. A subscript k on a d. A square root in a weird place. A function called softmax wrapping part of the expression but not all of it.

By the time we're done, you'll be able to point at every symbol in that formula and say exactly what it means, what shape it has, and why it's there. That's the destination. Let's start walking.

The Alphabet: Scalars, Vectors, Matrices, and Tensors

The first thing any ML paper does — often silently — is use the visual appearance of a symbol to signal what kind of object it represents. This is the single most important convention to internalize, because it tells you the shape of everything before you read a single equation.

A lowercase italic letter like x or n usually means a scalar — a single number. The learning rate. A count. An index. One lonely floating-point value.

A bold lowercase letter like x or w means a vector — a one-dimensional array. A row of your dataset. The weights of a single neuron. In code, that's a np.array([1.0, 2.0, 3.0]).

A bold uppercase letter like X or W means a matrix — a two-dimensional array. Your entire dataset might be X, where each row is a sample and each column is a feature. A weight matrix connecting two layers of a neural network is W.

And a calligraphic or special uppercase letter like 𝒳 (curly X) usually means a tensor — anything with three or more dimensions — or sometimes a set or a space. Context tells you which.

I'll be honest: not every paper follows these conventions religiously. Some papers use plain unbolded letters for everything and expect you to figure it out from context. That's frustrating, but knowing the convention means you'll catch it when authors do follow it — which is most of the time.

Back to our attention formula: Q, K, and V are uppercase. They're matrices. Each one is a two-dimensional array where rows represent tokens and columns represent features. That's the first piece of the puzzle decoded.

Greek Letters: The Ten That Cover 90% of ML

Greek letters look exotic, but they're used for the same reason we use i for loop counters — convention. Here are the ones you'll see in the overwhelming majority of ML papers, mapped to what they typically mean:

θ (theta) is the workhorse. It means "parameters" — all the learnable weights in your model, bundled together. When a paper writes f_θ(x), it means "the function f with parameters θ, applied to input x." That's your neural network. θ is what gradient descent adjusts.

α (alpha) almost always means the learning rate. The step size in gradient descent. When you see θ ← θ - α∇L, that's saying: update the parameters by subtracting the learning rate times the gradient of the loss. That's one line of SGD.

λ (lambda) is the regularization strength. When a paper adds λ‖w‖² to a loss function, that's L2 regularization — penalizing large weights. λ controls how much you care about keeping weights small versus fitting the training data. Also shows up as eigenvalues in linear algebra contexts.

σ (sigma) pulls double duty. Lowercase σ is either the standard deviation or the sigmoid function σ(x) = 1/(1 + e⁻ˣ). Uppercase Σ is summation (we'll get there). And in covariance matrices, Σ (bold uppercase sigma) is the covariance matrix itself. Three meanings from one letter. Context is everything.

μ (mu) is the mean. In a Gaussian distribution 𝒩(μ, σ²), μ is the center and σ² is the spread. In batch normalization, you compute μ and σ per mini-batch. Straightforward.

∇ (nabla) is the gradient operator. When you see ∇_θ L, it means "the vector of partial derivatives of the loss L with respect to each parameter in θ." It points in the direction of steepest increase. Gradient descent goes the opposite way. We'll unpack this more when we hit calculus notation.

η (eta) is sometimes used for the learning rate instead of α. Papers vary. Same meaning, different letter. Welcome to notation.

ε (epsilon) means "a very small number." It appears in two very different contexts: as the numerical stability constant in Adam optimizer's denominator (1e-8), and as the exploration parameter in ε-greedy reinforcement learning. Also shows up wherever you need to avoid division by zero.

β (beta) appears in Adam optimizer as the momentum coefficients (β₁ and β₂), in regression as coefficients, and in β-VAE as the weight on the KL divergence term. Another overloaded letter.

γ (gamma) shows up as the discount factor in reinforcement learning, as the scale parameter in batch normalization, and in a few other specialized contexts.

I still have to look up whether ξ is "xi" or "psi" every single time. Don't worry if you can't pronounce them all. You need to recognize them on sight, not recite them from memory.

Subscripts and Superscripts: The Triple Overload

This is where notation goes from "slightly unfamiliar" to "actively hostile." Subscripts and superscripts carry at least three completely different meanings depending on context, and papers rarely tell you which one they're using.

Subscripts for indexing. xᵢ means "the i-th element of vector x." Wᵢⱼ means "the element in row i, column j of matrix W." This is the most common meaning — it maps directly to x[i] and W[i][j] in code.

Superscripts for data samples. Here's where it gets confusing. Many ML papers use x⁽ⁱ⁾ — with the superscript in parentheses — to mean "the i-th training example." So x⁽³⁾ is the third sample in your dataset, and x⁽³⁾₂ means "the second feature of the third training example." The parentheses are crucial: they tell you this is an index, not an exponent.

Superscripts for actual exponents. means x squared. σ² means variance (sigma squared). No parentheses. Same position, completely different meaning.

Superscripts for transpose and inverse. Xᵀ means the transpose of matrix X (swap rows and columns). W⁻¹ means the inverse of matrix W. Neither one is an exponent, despite living in the same spot.

In our attention formula, Kᵀ is K transposed — we flip the keys matrix so that the matrix multiplication QKᵀ produces a score matrix. And dₖ uses a subscript — it's the dimensionality of the key vectors. The k isn't an index. It's a label. Another overload.

No one tells you upfront that the same typographic position means four different things. This is, I think, the single biggest source of confusion for engineers reading ML papers. The only remedy is context: look at what the symbol is attached to, check the paper's notation section (if it has one), and ask yourself which interpretation makes the shapes work out.

Decorators: What Sits on Top of Variables

Beyond subscripts and superscripts, symbols get little decorations that change their meaning. Think of them as adjectives modifying a noun.

A hat (ŷ) means "estimated" or "predicted." When a paper writes ŷ = f_θ(x), the hat tells you this is the model's prediction, not the true value. The true value is plain y. The loss function measures the distance between y and ŷ. Once you see this, loss functions become obvious: they're all variations on "how far is the prediction from the truth?"

A bar (x̄) means "average" or "mean." x̄ = (1/n)Σxᵢ. That's it. When you see a bar on top of something, think np.mean().

A tilde (x̃) means "modified," "noisy," or "approximate." In denoising autoencoders, the input is corrupted with noise and written as x̃. In some papers, it marks a sample from a distribution. It's a flag that says "this isn't the original — something happened to it."

A star (θ*) means "optimal." θ* = argmin L(θ) means "the parameters that minimize the loss." When you see a star, the author is talking about the best possible value, the one you're searching for during training.

A dot (ẋ) means "time derivative." You'll see this mostly in physics-informed neural networks and differential equations, less in standard ML. But it's good to recognize so it doesn't throw you off.

Summation and Product Notation

The sigma notation Σ is a for-loop that adds. The product notation Π is a for-loop that multiplies. Everything below the symbol is the loop variable and its start value. Everything above is where it stops. Everything to the right is the body of the loop.

Let's trace through a concrete example. Mean Squared Error:

MSE = (1/n) Σᵢ₌₁ⁿ (yᵢ - ŷᵢ)²

Reading that left to right: divide by n (for the mean), then starting at i=1 and going to i=n, compute the squared difference between the true value yᵢ and the predicted value ŷᵢ, and add them up. In Python:

import numpy as np

y = np.array([3.0, 5.0, 2.5, 7.0])
y_hat = np.array([2.8, 5.2, 2.3, 6.8])

# The entire formula in one line
mse = np.mean((y - y_hat) ** 2)  # 0.04

Now products. The likelihood function for independent data points is:

P(data) = Πᵢ₌₁ⁿ P(xᵢ)

Multiply the probability of each data point together. The problem? Multiply 1000 numbers smaller than 1 and you get a number so close to zero that 64-bit floating-point can't represent it. The computer literally rounds it to zero. This is numerical underflow, and it's why we take logarithms — ln(a · b) = ln(a) + ln(b) turns that product into a sum of manageable negative numbers. Every loss function in ML that involves probabilities uses this trick. Every single one.

A few bookkeeping rules show up in derivations repeatedly. You can pull constants out of sums: Σ(c·xᵢ) = c·Σxᵢ. You can split sums: Σ(xᵢ + yᵢ) = Σxᵢ + Σyᵢ. These aren't clever insights — they're the mathematical equivalent of refactoring a loop. Recognizing them prevents you from thinking each derivation step is a magic trick when it's actually routine bookkeeping.

Rest Stop

If you've made it this far, congratulations. You can now read the basic building blocks of most ML formulas: you know what the Greek letters mean, how subscripts and superscripts work, what the decorators signal, and how Σ and Π map to loops. That's genuinely useful — it covers the notation in most blog posts, tutorials, and the simpler sections of papers. If you want to stop here, the short version of everything that follows is: ∇ means gradient, ‖·‖ means size, p(x|y) means conditional probability, and argmin means "find the input that minimizes." There. You're 70% of the way there. But if the discomfort of not knowing the rest is nagging at you, read on.

Functions, Mappings, and Optimization Notation

A function in math notation is written as f: A → B, which reads "f maps elements of set A to elements of set B." The set A is called the domain (what goes in), B is the codomain (what could come out), and the actual outputs form the range (what does come out). Every ML model is a function. A classifier f: ℝⁿ → {0, 1, ..., k} takes an n-dimensional feature vector and outputs a class label. Training is the search for the best function within the space your architecture can represent.

Composition is chaining functions: (g ∘ f)(x) = g(f(x)). Apply f first, then g. A neural network is literally a composition: f₃(f₂(f₁(x))), where each fₖ is one layer. The "deep" in deep learning means "lots of compositions." Backpropagation uses the chain rule to differentiate through these compositions — which is why composition matters before you touch calculus.

Now the optimization operators. argmin and argmax are among the most important symbols in ML, and they trip people up because they return the input, not the output.

θ* = argmin_θ L(θ)

This says: find the value of θ that makes L(θ) as small as possible, and call that value θ*. The min operator would give you the smallest value of L itself. The argmin gives you the θ that achieves it. In Python terms, min returns the value, argmin returns the index. When a paper writes "we minimize the cross-entropy loss," it means we're looking for the argmin — the parameters that produce the minimum loss.

import numpy as np

losses = np.array([0.9, 0.3, 0.7, 0.1, 0.5])

np.min(losses)      # 0.1  — the smallest loss value
np.argmin(losses)   # 3    — the INDEX that achieves it

In the context of ML training, argmin_θ L(θ) is what the optimizer is doing every step — nudging θ toward the value that minimizes the loss. The entire field of optimization is about doing this efficiently.

Norms: Measuring Size with Double Bars

When you see double vertical bars around something — ‖w‖ — that's a norm. It measures the "size" or "length" of a vector or matrix. The subscript tells you which norm.

The L1 norm ‖w‖₁ is the sum of absolute values: |w₁| + |w₂| + ... + |wₙ|. Think of it as the Manhattan distance — how far you'd walk on a city grid. L1 regularization (λ‖w‖₁ added to the loss) pushes weights toward exactly zero, creating sparsity. That's Lasso regression.

The L2 norm ‖w‖₂ is the Euclidean distance: √(w₁² + w₂² + ... + wₙ²). The straight-line distance. L2 regularization (λ‖w‖₂² — note the square) pushes weights toward small values but doesn't force them to zero. That's Ridge regression. When papers refer to "weight decay," they mean L2 regularization.

The Frobenius norm ‖A‖_F is the L2 norm but for matrices — square all elements, sum them, take the square root. It appears when regularizing weight matrices in neural networks.

import numpy as np

w = np.array([3.0, -4.0, 0.0, 2.0])

l1 = np.sum(np.abs(w))           # 9.0 — sum of absolute values
l2 = np.sqrt(np.sum(w ** 2))     # 5.385 — Euclidean length
# np.linalg.norm(w, 1) and np.linalg.norm(w, 2) do the same

Back to the attention formula: √dₖ is the square root of the key dimension. Why divide by it? Because the dot products in QKᵀ grow with the dimension of the vectors. Without the scaling, the softmax would push all its probability mass to the largest value, making gradients vanishingly small. The √dₖ is a normalization — a relative of the norm idea — that keeps the values in a range where softmax behaves well.

Probability Notation

Probability notation is its own dialect within mathematical notation, and ML is soaked in it. The core symbols:

P(A) or p(x) is the probability of event A or the probability density at point x. The convention is uppercase P for discrete probabilities and lowercase p for continuous densities, though papers violate this constantly.

p(y|x) is the conditional probability — the probability of y given that we know x. The vertical bar | reads as "given." When a paper writes p(y|x, θ), it means "the probability of output y given input x and model parameters θ." That's what your model computes at inference time.

X ~ 𝒩(μ, σ²) reads "X is distributed as a Normal (Gaussian) distribution with mean μ and variance σ²." The tilde ~ means "is distributed as." The curly 𝒩 is the calligraphic font for "Normal." When you initialize neural network weights with torch.randn(), you're sampling from 𝒩(0, 1).

𝔼[X] is the expectation — the average value of X if you sampled it infinitely many times. The fancy 𝔼 is "blackboard bold" E. In practice, we approximate it with sample means: 𝔼[X] ≈ (1/n)Σxᵢ. When a loss function is written as 𝔼[L(x, y)], it means "the average loss over the data distribution," and we approximate it by averaging over our training batch.

means "proportional to." In Bayesian ML, you'll see p(θ|data) ∝ p(data|θ)p(θ), which says "the posterior is proportional to the likelihood times the prior." The hides a normalizing constant that's often intractable to compute — which is the entire motivation behind variational inference and MCMC.

Here's the thing that took me too long to realize: p(y|x) isn't scary notation. It's the answer to "what does my model predict for this input?" Every classifier, every regression model, every language model — they all compute some version of p(y|x). Conditional probability isn't an abstract math concept. It's the literal thing your model outputs.

Calculus Notation: Gradients and Derivatives

If you've trained a neural network, you've used calculus — the framework did it for you via autograd. But reading papers requires recognizing the notation.

The partial derivative ∂L/∂w reads "the partial derivative of L with respect to w." It tells you: if you nudge w by a tiny amount, how much does L change? The ∂ symbol (a curly d) signals that L depends on multiple variables, and we're only looking at how one of them affects it while holding the others fixed.

The gradient ∇_θ L is the vector of all partial derivatives: [∂L/∂θ₁, ∂L/∂θ₂, ..., ∂L/∂θₙ]. It points in the direction where L increases fastest. Gradient descent goes the opposite direction: θ ← θ - α∇_θ L. That's the entire algorithm in one line of notation.

The Jacobian J extends the gradient to vector-valued functions. If your function takes a vector of n inputs and produces a vector of m outputs, the Jacobian is the m × n matrix of all partial derivatives. Each row is the gradient of one output with respect to all inputs. Backpropagation through a neural network is, under the hood, a chain of Jacobian-vector products.

The Hessian H is the matrix of second derivatives — it tells you about the curvature of the loss landscape. If the gradient tells you which way is downhill, the Hessian tells you whether the hill is steep or gentle, curved or flat. Second-order optimizers like L-BFGS use Hessian information (or approximations of it) to take smarter steps than vanilla gradient descent.

I'll be honest — the distinction between numerator layout and denominator layout conventions for Jacobians still trips me up. Different textbooks and papers use different conventions for how to arrange the rows and columns. If a shape doesn't make sense in a derivation, check whether the author is using the transposed convention. It's a real source of bugs in manual implementations.

Einstein Summation: The Modern Shorthand

Einstein summation notation is a convention from physics that has become essential in modern deep learning. The rule is elegant: any index that appears twice in a single term is implicitly summed over. No Σ needed.

Consider matrix multiplication. In standard notation: Cᵢₖ = Σⱼ AᵢⱼBⱼₖ. In Einstein notation: Cᵢₖ = AᵢⱼBⱼₖ. The j appears in both A and B, so it's summed over. That's it. The repeated index vanishes in the result.

This becomes powerful with NumPy and PyTorch's einsum function, which directly implements Einstein notation:

import numpy as np

A = np.random.randn(3, 4)  # 3×4 matrix
B = np.random.randn(4, 5)  # 4×5 matrix

# Matrix multiplication: j is summed over
C = np.einsum('ij,jk->ik', A, B)  # 3×5 result

# Dot product: i is summed over
a = np.random.randn(4)
b = np.random.randn(4)
dot = np.einsum('i,i->', a, b)  # scalar

# Batched matrix multiply (attention scores!)
Q = np.random.randn(8, 10, 64)  # batch=8, tokens=10, d=64
K = np.random.randn(8, 10, 64)
scores = np.einsum('bqd,bkd->bqk', Q, K)  # 8×10×10

That last example is the heart of self-attention. Q and K share the batch dimension b (preserved in output) and the feature dimension d (summed over — it appears in both inputs but not the output). The result is a score matrix for each batch, with one score for every query-key pair. That's QKᵀ from our attention formula, expressed in one string.

I'm still developing my intuition for reading complex einsum patterns on first pass. For anything beyond two operands, I find myself writing out the shapes on paper. That's fine. The point isn't to be fast — it's to be precise.

Blackboard Bold and Calligraphic Fonts

ML papers use fancy fonts to distinguish types of mathematical objects. This is pure convention — the fonts carry no mathematical content — but recognizing them speeds up reading.

Blackboard bold (the hollow letters) is used for number sets and operators:

is the real numbers. ℝⁿ is n-dimensional real space. When a paper says x ∈ ℝ⁷⁶⁸, it means x is a vector with 768 real-valued components — like a BERT embedding. is the integers. is the natural numbers. 𝔼 is expectation. is probability.

Calligraphic (the curly letters) is used for sets, spaces, distributions, and loss functions:

𝒟 is typically a dataset or a data distribution. 𝒩 is the Normal distribution. is a loss function (some papers use 𝒥 for the objective instead). 𝒳 and 𝒴 are the input and output spaces. is a hypothesis class — the set of all functions your model architecture can represent.

When you see 𝒟 = {(x⁽ⁱ⁾, y⁽ⁱ⁾)}ᵢ₌₁ⁿ, that's "a dataset 𝒟 consisting of n input-output pairs." Notice how many conventions collide in that one expression: calligraphic 𝒟, set-builder curly braces, parenthesized superscripts for sample indices, and a subscript-to-superscript range on the index. If you can read that line, you can read most ML papers.

Logarithms and Exponentials: The Numerical Trick

The natural logarithm ln(x) is the inverse of the exponential . Its killer feature for ML: ln(a · b) = ln(a) + ln(b). Products become sums.

Remember that likelihood P(data) = Πᵢ P(xᵢ) that underflows to zero? Take the log: ln P(data) = Σᵢ ln P(xᵢ). Now it's a sum of manageable negative numbers. This is the log-likelihood, and maximizing it is identical to maximizing the likelihood because log is monotonically increasing. We then negate it (because optimizers minimize, not maximize) and we get the negative log-likelihood — which, for classification, is cross-entropy loss:

L = -(1/n) Σᵢ [yᵢ ln(pᵢ) + (1-yᵢ) ln(1-pᵢ)]

Three names for the same thing: minimizing negative log-likelihood, minimizing cross-entropy, maximizing likelihood. This is the pattern behind almost every loss function: write the probabilistic model, take its likelihood, apply log, negate, average.

import numpy as np

y = np.array([1, 0, 1, 1])               # true labels
p = np.clip([0.9, 0.1, 0.8, 0.7], 1e-15, 1-1e-15)  # predictions (clipped!)

bce = -(y * np.log(p) + (1 - y) * np.log(1 - p))
loss = np.mean(bce)  # ~0.2014

The clip is non-negotiable. ln(0) is negative infinity, and it will produce NaN in your gradients. If your training loss ever shows NaN or inf, check for zeros being fed to a log before you look anywhere else.

The softmax function also lives here: softmax(zᵢ) = eᶻⁱ / Σⱼ eᶻʲ. It takes a vector of raw scores (logits) and converts them to probabilities that sum to 1. The exponential ensures everything is positive, and the denominator normalizes. In our attention formula, softmax turns the raw attention scores QKᵀ/√dₖ into a probability distribution over keys for each query.

Putting It All Together: Reading the Attention Formula

We've arrived at the destination. Let's read the formula we started with, symbol by symbol:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Q, K, V — uppercase letters, so they're matrices. Q is the query matrix (shape: tokens × d_model). K is the key matrix. V is the value matrix. Each row is one token's representation.

Kᵀ — the superscript T means transpose. We flip K so its shape goes from (tokens × dₖ) to (dₖ × tokens). This makes the matrix multiplication work: (tokens × dₖ) · (dₖ × tokens) = (tokens × tokens).

QKᵀ — matrix multiplication of Q and the transposed K. The result is a (tokens × tokens) matrix where each entry is the dot-product similarity between a query token and a key token. This is the attention scores — which queries attend to which keys.

dₖ — lowercase d with subscript k. A scalar. The dimensionality of the key vectors. The subscript is a label, not an index.

√dₖ — square root of that scalar. We divide by it to prevent the dot products from getting too large and causing the softmax to saturate.

softmax(…) — the function that converts the scaled scores to probabilities. Each row of the resulting matrix sums to 1. Now each query has a probability distribution over all keys.

softmax(…) V — matrix multiply the attention weights with V. This is a weighted sum of value vectors. Each query's output is a blend of all value vectors, weighted by how much attention that query pays to each key.

That's it. One formula, twelve notation concepts. And now you can read it like code.

The Pattern You'll See Everywhere

Almost every ML loss function follows the same template: write a probabilistic model, take its likelihood (a product using Π), apply logarithm (turning it into a sum using Σ), negate it (because we minimize), and average over the dataset. Cross-entropy, KL divergence, variational lower bounds — they all emerge from this recipe. Once you see the pattern, new loss functions stop being intimidating.

If you're still with me, thank you. I hope it was worth it.

We started with a confession about closing browser tabs at the sight of Greek letters, and we've ended by reading one of the most consequential formulas in modern AI symbol by symbol. Along the way, we learned the visual conventions for scalars, vectors, and matrices. We met ten Greek letters that cover most of ML. We untangled the triple overloading of subscripts and superscripts. We decoded decorators, summation, products, norms, probability notation, calculus operators, Einstein summation, and the fancy fonts that papers use to distinguish mathematical objects.

My hope is that the next time you open an ML paper and see a wall of notation, instead of closing the tab, you'll slow down, identify each symbol, and translate it piece by piece — the way we translated the attention formula here. You have the decompression algorithm now. The rest is practice.

✅ What You Should Now Be Able To Do