Numerical Computing
Computers lie about numbers. They store approximations, not exact values, and naive math on those approximations produces silent garbage — NaN losses, zero gradients, training runs that drift into nonsense. This section builds your instinct for where the traps are: IEEE 754 floating point, catastrophic cancellation, the log-sum-exp trick, why you never invert a matrix, how GPUs actually move data, and the mixed-precision regime that is now the default for all serious training. These aren't academic footnotes. They're the difference between a model that converges and one that silently fails at 3 AM.
I avoided digging into numerical computing for longer than I'd like to admit. Every time a training run returned NaN, I'd restart it with a smaller learning rate and move on. Every time someone mentioned "catastrophic cancellation," I'd nod wisely and change the subject. Finally, after spending two full days debugging a loss that turned out to be an unprotected log(0), the discomfort of not understanding what was actually happening with my numbers grew too great. Here is that dive.
Numerical computing is the study of how to do math on machines that can't actually do math perfectly. The field has been around since the 1950s, when scientists first realized that computers introduce errors every time they store a number. In ML, these errors compound across millions of operations per training step, and the consequences range from subtle accuracy drift to spectacular NaN explosions.
Before we start, a heads-up. We're going to be walking through binary representations, exponents, and some matrix algebra, but you don't need to know any of it beforehand. We'll add what we need, one piece at a time.
This isn't a short journey, but I hope you'll be glad you came.
The number line has holes
When small errors become catastrophes
The log-sum-exp trick
Rest stop
Solving systems without inverting matrices
When matrices get too big
The GPU: a factory floor, not an office
Mixed precision — training with half the bits
Wrap-up
Resources
Computers Lie About Numbers
Let's start with the most famous lie in computing.
a = 0.1 + 0.2
print(a == 0.3) # False
print(f"{a:.20f}") # 0.30000000000000004441
The first time I saw this, I genuinely questioned whether Python was broken. It's not. The issue is deeper than any programming language. Computers think in binary — base 2 — and in base 2, the number 0.1 has no exact representation. It's the same reason 1/3 has no exact decimal representation: 0.33333... goes on forever. In binary, 0.1 goes on forever too. So the computer stores the closest binary fraction it can fit, and that introduces a tiny error. This is not a bug. It's a law of the universe, like friction.
The IEEE 754 standard, established in 1985 and used by virtually every computer on Earth, defines how these approximations work. Every floating-point number is stored as three pieces: a sign bit (positive or negative), an exponent (which sets the magnitude), and a mantissa (also called the significand, which carries the actual precision digits). Think of it like scientific notation. The number -2.998 × 10⁸ has a sign (negative), a significand (2.998), and an exponent (8). IEEE 754 does the same thing, but in base 2: the value is (-1)^sign × 1.mantissa × 2^(exponent - bias).
Here's a way to think about it that stuck with me. Imagine a ruler where the tick marks are not evenly spaced. Near zero, the ticks are packed tightly together — you can distinguish between very small numbers with fine precision. But as you move toward larger magnitudes, the ticks spread farther and farther apart. Between 1.0 and 2.0, there are millions of representable float32 values. Between 10,000,000 and 10,000,001, there might be none. The ruler covers an enormous range, but it pays for that range by getting coarser as the numbers grow.
The formats you'll encounter in practice are float64 (double precision: 1 sign + 11 exponent + 52 mantissa bits, about 15-16 decimal digits of precision), float32 (single precision: 1 sign + 8 exponent + 23 mantissa bits, about 7 decimal digits), float16 (half precision: 1 sign + 5 exponent + 10 mantissa bits, about 3.3 decimal digits), and bfloat16 (brain float: 1 sign + 8 exponent + 7 mantissa bits, about 2.4 decimal digits but the same range as float32). NumPy defaults to float64. PyTorch defaults to float32. That gap matters enormously and we'll come back to it.
The Number Line Has Holes
There's a number called machine epsilon, often written ε, and it tells you the fundamental resolution of your floating-point format. It's the smallest number you can add to 1.0 and get something different from 1.0. For float32, machine epsilon is about 1.19 × 10⁻⁷. For float64, it's about 2.22 × 10⁻¹⁶. Anything smaller than that, relative to the number you're adding it to, gets swallowed. Vanishes. As if it never existed.
import numpy as np
print(np.finfo(np.float32).eps) # 1.1920929e-07
print(np.finfo(np.float64).eps) # 2.220446049250313e-16
# When adding small to large, the small disappears
big = np.float32(1e7)
small = np.float32(1.0)
print(big + small == big) # True — the 1.0 is gone entirely
Stop and feel the weight of that last line. Adding 1.0 to ten million gives you... ten million. The 1.0 got absorbed. In float32, ten million and ten million plus one are the same number. The tick marks on our ruler are too far apart to tell them apart.
Now think about what gradient accumulation does. It adds many small gradient updates to a large running sum, thousands of times. If your accumulated loss is in the thousands and a gradient update is 0.00001, that update can be smaller than machine epsilon relative to the sum. It disappears. Your model looks like it's training, the loss prints to your terminal every epoch, but the weights haven't budged. I've seen this happen in production. It's the quietest bug you'll ever encounter.
There are also special values. When the exponent bits are all ones and the mantissa is zero, you get infinity — the result of overflow, like computing e^1000. When the exponent bits are all ones but the mantissa is non-zero, you get NaN (Not a Number), which is what you get from 0/0 or infinity minus infinity. NaN is contagious: any operation involving NaN produces NaN. And here's the strangest thing — NaN is the only floating-point value that is not equal to itself. That's actually how you detect it: x != x is true only when x is NaN.
At the other extreme, when all exponent bits are zero, you get denormalized numbers (also called subnormals). These are the tiny numbers between zero and the smallest normal float, filling what would otherwise be an abrupt gap. They matter because gradients near zero sometimes live in this range. The catch: some hardware processes denormals 10-100x slower than normal floats, which can cause mysterious performance drops in training.
Never compare floating-point numbers with ==. Use np.isclose(a, b) or np.allclose(a, b) with appropriate tolerances. This applies to loss values, metrics, convergence checks — everywhere. Also: np.isnan(x) is your friend. Sprinkle it liberally when debugging training.
When Small Errors Become Catastrophes
Floating-point errors are usually tiny — 10⁻⁷ relative error in float32, 10⁻¹⁶ in float64. The real danger is when an algorithm takes those tiny errors and amplifies them into disasters. That's what the term numerical instability means, and ML is riddled with opportunities for it.
The most common villain is catastrophic cancellation. It happens when you subtract two numbers that are nearly equal. Suppose a = 1.00000005 and b = 1.00000004 in float32. The true difference is 10⁻⁸, but float32 can only represent about 7 decimal digits. Both numbers stored the same way. The difference becomes exactly zero. All significant digits, gone.
Where does this ambush us in practice? Variance. The textbook one-pass formula says Var(X) = E[X²] - E[X]². Both terms can be enormous when the mean is large, but their difference — the actual variance — might be tiny. In float32, this formula will return zero for data with a small spread around a large mean. Worse, it can return a negative number, which is nonsensical for a variance. Batch normalization computes variance of activations on every forward pass. If the activations have a large mean and small variance, the textbook formula silently poisons your normalization layer.
The fix is Welford's algorithm, an online method that accumulates variance without ever subtracting two large numbers. It maintains a running mean and a running sum of squared deviations from that mean, updating both as each data point arrives. It's a single pass through the data, it's numerically stable, and it's what every serious library uses under the hood.
import numpy as np
# The textbook formula self-destructs for large means with small variance
data = np.float32([10000000.0, 10000001.0, 10000002.0])
naive_var = np.mean(data**2) - np.mean(data)**2
print(f"Naive variance: {naive_var}") # 0.0 — completely wrong
print(f"Correct variance: {np.var(data)}") # 0.6667 — np.var uses stable algorithm
# Welford's algorithm — stable single-pass variance
def welford_var(data):
n = 0
mean = 0.0
M2 = 0.0
for x in data:
n += 1
delta = x - mean
mean += delta / n
delta2 = x - mean
M2 += delta * delta2
return M2 / n
print(f"Welford variance: {welford_var(data.astype(np.float64)):.4f}") # 0.6667
Catastrophic cancellation also lurks in log(1 + x) when x is tiny. If x = 10⁻¹⁵, then 1 + x rounds to 1.0 in float64, and log(1.0) is exactly 0. The true answer, approximately 10⁻¹⁵, has been completely destroyed. Python and NumPy provide np.log1p(x) which computes log(1 + x) directly without first adding 1, keeping full precision. Similarly, np.expm1(x) computes e^x - 1 for small x without cancellation. These aren't obscure library curiosities — they show up constantly in log-likelihood computations and loss functions.
Another subtle trap: Kahan summation. When you add millions of small numbers (say, per-sample losses across a dataset), the running sum grows large while each new term is small. The small terms start getting swallowed by machine epsilon. Kahan's algorithm maintains a separate "compensation" variable that tracks the accumulated rounding error, feeding it back into the next addition. It turns O(n·ε) drift into O(ε) drift. Some frameworks use it internally for gradient accumulation, but it's worth knowing about because naive Python sum() does not.
The Log-Sum-Exp Trick
If you learn one numerical trick for ML, make it this one. It shows up in softmax, in log-likelihood computations, in attention scores, in variational inference, everywhere. I'm still a little amazed at how much mileage this single idea gets.
The problem starts innocently. You have a vector of numbers — let's say logits from a neural network — and you need to compute log(e^a₁ + e^a₂ + ... + e^aₙ). Maybe you're computing a softmax, or a log-partition function, or a log-marginal likelihood. In exact math, this is straightforward. On a computer, it's a minefield.
Let's walk through a concrete example. Suppose our logits are [1000, 1001, 1002]. We need e^1000. How big is that? It's a number with 434 digits. Float64 maxes out around 10³⁰⁸. So e^1000 overflows to infinity, and our whole computation returns inf. Now suppose our logits are [-1000, -1001, -1002]. Every e^aᵢ underflows to 0.0, and we're taking log(0), which gives us negative infinity. Both are garbage.
The trick is deceptively simple. Factor out the maximum value:
log(Σ e^aᵢ) = max(a) + log(Σ e^(aᵢ - max(a)))
Why does this work? Because Σ e^aᵢ = e^max(a) · Σ e^(aᵢ - max(a)), and taking the log separates it into max(a) + log(Σ e^(aᵢ - max(a))). Now the largest exponent in the sum is e^(max(a) - max(a)) = e⁰ = 1. Nothing overflows. And at least one term in the sum is 1, so the argument to log is always at least 1, so we never take log of zero. The answer is mathematically identical to the original expression. The numerics are completely different.
import numpy as np
def naive_logsumexp(x):
return np.log(np.sum(np.exp(x)))
def stable_logsumexp(x):
c = np.max(x)
return c + np.log(np.sum(np.exp(x - c)))
big = np.array([1000.0, 1001.0, 1002.0])
print(naive_logsumexp(big)) # inf — overflow
print(stable_logsumexp(big)) # 1002.408 — correct
small = np.array([-1000.0, -1001.0, -1002.0])
print(naive_logsumexp(small)) # -inf — underflow
print(stable_logsumexp(small)) # -999.592 — correct
# scipy has it built in, and so does PyTorch
from scipy.special import logsumexp
print(logsumexp(big)) # 1002.408
The softmax function uses the same insight. Naively computing e^xᵢ / Σe^xⱼ overflows when any xᵢ is large. The fix: subtract max(x) from every element before exponentiating. Because the constant appears in both the numerator and denominator, it cancels perfectly — the output is mathematically identical, but every exponent is now ≤ 0, so nothing overflows.
def stable_softmax(x):
shifted = x - np.max(x)
exp_x = np.exp(shifted)
return exp_x / np.sum(exp_x)
logits = np.array([1000.0, 1001.0, 1002.0])
print(stable_softmax(logits)) # [0.0900, 0.2447, 0.6652] — correct
This is why we work with log-probabilities throughout ML rather than raw probabilities. A probability of 10⁻³⁰⁰ underflows to zero in any floating-point format. But its log, -690.7, is a perfectly ordinary number that fits comfortably in a float32. Multiplying probabilities becomes adding log-probabilities. Division becomes subtraction. Everything stays in a numerically comfortable range.
And this is exactly why PyTorch's CrossEntropyLoss expects raw logits, not probabilities. Internally, it computes log_softmax — which combines the log and the softmax into a single numerically stable operation — and then applies the negative log-likelihood. If you apply softmax yourself first and then pass the result to NLLLoss, you've broken the stability guarantee. I've seen this mistake in production code. The model trains, sort of, but the loss landscape becomes noisy because of floating-point gremlins in the separate softmax-then-log pipeline.
Every time you see np.exp() in your code, pause. Can the input be large positive? Can it be very negative? If so, you probably need the subtract-the-max pattern. And every time you see np.log(), ask: can the argument be zero or negative? If so, you need log1p or the log-sum-exp formulation. Building this reflex will save you from 90% of numerical bugs.
Rest Stop
Congratulations on making it this far. If you want to stop here, you can.
You now have a mental model for how computers represent numbers (a ruler with uneven tick marks that get coarser at larger magnitudes), why subtracting nearly equal numbers is dangerous (catastrophic cancellation), and the single most important stabilization trick in all of ML (log-sum-exp / subtract-the-max). With those three ideas, you can diagnose and fix the vast majority of NaN explosions and silent numerical drift in training pipelines.
That's the short version. You're 60% of the way there.
What's ahead: how to solve linear systems without inverting matrices (and why the condition number tells you whether to even trust the answer), how GPUs actually move data around (spoiler: memory bandwidth is the bottleneck, not compute), and the mixed-precision training regime that has become the default for all serious work. These aren't academic topics — they're the skills that separate someone who can train a model from someone who can train a model efficiently and debug it when the numbers go wrong.
But if you feel solid on the foundations and the discomfort of not knowing what happens in those deeper layers is nagging at you, read on.
Solving Systems Without Inverting Matrices
I still catch myself reaching for matrix inversion. The notation x = A⁻¹b is so clean on paper that it feels like the natural thing to compute. It's not. Computing a matrix inverse is roughly three times more expensive than solving the system directly, and the numerical errors are significantly worse. This is one of those cases where the mathematically elegant approach and the computationally smart approach are different things.
Given Ax = b, use np.linalg.solve(A, b). Under the hood, this performs LU decomposition — it factors A into a product of a lower-triangular matrix L and an upper-triangular matrix U, then solves two easy triangular systems by back-substitution. Same answer, less work, fewer rounding errors.
If your matrix A happens to be symmetric and positive definite — and in ML this is common, think covariance matrices, kernel matrices, the Gram matrix XᵀX — you can do better still. Cholesky decomposition factors A = LLᵀ where L is lower-triangular. It's about twice as fast as general LU, and it's more numerically stable because it exploits the matrix's structure. Use np.linalg.cholesky() followed by scipy.linalg.cho_solve().
But how do you know if your system is even trustworthy? The condition number of a matrix, written κ(A), measures how much errors in the input get amplified in the output. If κ(A) = 10ᵏ, you lose roughly k digits of precision in the solution. A condition number of 10¹⁶ means your float64 answer has essentially zero reliable digits — you've been rearranging noise. I'll be honest: condition numbers still intimidate me a bit, but the practical rule is simple — check it with np.linalg.cond(A), and if the number is bigger than about 10¹⁰, treat your solution with deep suspicion.
import numpy as np
A = np.array([[4.0, 1.0], [1.0, 3.0]])
b = np.array([1.0, 2.0])
# Don't do this:
x_bad = np.linalg.inv(A) @ b
# Do this:
x_good = np.linalg.solve(A, b)
print(f"Condition number: {np.linalg.cond(A):.2f}") # 2.62 — healthy
# Ill-conditioned example
X = np.random.randn(100, 50)
XtX = X.T @ X
lam = 1.0
print(f"Unregularized cond: {np.linalg.cond(XtX):.0f}")
print(f"Regularized cond: {np.linalg.cond(XtX + lam * np.eye(50)):.0f}")
That last example connects to something deep. In ridge regression, the normal equations give (XᵀX)β = Xᵀy. If your features are correlated, XᵀX becomes ill-conditioned — its smallest eigenvalues are near zero, the condition number explodes, and the solution oscillates wildly with tiny changes in the data. Adding a regularization term λI to get (XᵀX + λI)β = Xᵀy inflates those tiny eigenvalues, pulling the condition number back to something sane. Regularization isn't only about preventing overfitting. It's also about making the linear algebra actually solvable on a machine with finite precision. Two problems, one fix.
When Matrices Get Too Big
LU and Cholesky are direct solvers — they give you an answer (up to floating-point rounding) in a fixed number of steps. But they scale as O(n³). When n is a million — a large graph Laplacian, a kernel matrix for a big dataset — you can't even store the full matrix in memory, let alone factor it.
Iterative methods approach the solution gradually, producing a better approximation with each step. The workhorse for symmetric positive definite systems is the conjugate gradient method. It only needs to compute matrix-vector products Av, never the full matrix itself. That means it works beautifully with sparse matrices — matrices that are mostly zeros, stored in compact formats that only record the non-zero entries.
And sparse matrices are everywhere in ML. A document-term matrix with 100,000 documents and 50,000 words has 5 billion entries, but any single document uses maybe 500 unique words. That's 99.999% zeros. Storing the full dense matrix is absurd. The two standard sparse formats are CSR (Compressed Sparse Row) and CSC (Compressed Sparse Column). CSR stores three arrays — the non-zero values, their column indices, and pointers to where each row starts — making row slicing and matrix-vector products fast. CSC is the column-oriented analog. In Python, scipy.sparse handles both. The rule of thumb: above about 90% zeros, sparse wins on both memory and speed.
For finding eigenvalues of large sparse matrices — the top principal components, the smallest eigenvectors of a graph Laplacian for spectral clustering — the tools are different again. Power iteration is the simplest: multiply by A and normalize, repeat. It converges to the largest eigenvector. Google's original PageRank was essentially power iteration on the web's link matrix. For more eigenvectors simultaneously, the Lanczos algorithm (SciPy's scipy.sparse.linalg.eigsh()) is the standard. And for truly massive matrices where even Lanczos is slow, randomized SVD — project into a random low-dimensional subspace, then compute exact SVD there — is surprisingly effective. Scikit-learn's PCA uses it by default.
The GPU: A Factory Floor, Not an Office
A CPU is like a small team of brilliant engineers. Maybe 16 of them, each capable of complex reasoning, branching logic, and creative problem-solving. A GPU is like a factory floor with thousands of workers, each doing one simple repetitive task in lockstep. The CPU excels when the work is varied and sequential. The GPU excels when the work is uniform and parallel. Matrix multiplication — the heartbeat of neural networks — is exactly the kind of uniform, embarrassingly parallel work that a factory floor devours.
The GPU memory hierarchy is where most performance stories are actually told. Global memory — the VRAM you see on spec sheets, 24GB on an RTX 4090, 80GB on an A100 — is large but slow. Shared memory is a small, fast scratchpad (tens of KB) shared within a thread block, sitting on the chip itself. Registers are the fastest, private to each thread. The counterintuitive reality is that memory bandwidth, not compute, is usually the bottleneck. Moving data from global memory to the compute cores takes far longer than the arithmetic itself. A modern GPU can perform trillions of operations per second, but if every operation needs to fetch from global memory, it's sitting idle waiting for data most of the time.
This is why algorithms like FlashAttention matter. Standard attention computes and stores the full N×N attention matrix in global memory. FlashAttention tiles the computation so that each block fits in shared memory, processes it entirely on-chip, and writes only the final output back to global memory. The FLOPs are actually the same or slightly higher. But the memory traffic drops dramatically, so it's much faster. The lesson generalizes: in GPU computing, the question is rarely "how many operations?" and almost always "how many bytes moved?"
Modern GPUs also include Tensor Cores, specialized hardware units that perform small matrix multiplies (4×4 or 8×8 blocks) in reduced precision at extraordinary speed. A Tensor Core can multiply in float16 while accumulating the result in float32, getting roughly 2-8x the throughput of standard CUDA cores. This hardware is what makes the next section practical.
Mixed Precision — Training with Half the Bits
Training in full float32 is safe but wasteful. That's the insight behind mixed-precision training: most neural network operations — forward pass matrix multiplies, convolutions, attention — don't need 32 bits of precision. They work fine in 16-bit. So run the forward and backward passes in half precision (float16 or bfloat16), which uses half the memory and runs faster on Tensor Cores. Keep a master copy of the weights in float32, and do the weight update step in float32. That's the whole scheme.
The danger with float16 is its limited range. The smallest positive normal float16 is about 6×10⁻⁵. Gradient values smaller than this underflow to zero, and your model silently stops learning. Loss scaling handles this: before the backward pass, multiply the loss by a large constant (say 1024). This scales all gradients up into float16's representable range. After the backward pass, divide by the same constant before updating weights. PyTorch's GradScaler does this automatically, dynamically increasing the scale when training is stable and decreasing it when it detects overflow.
import torch
from torch.cuda.amp import autocast, GradScaler
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()
for inputs, targets in dataloader:
optimizer.zero_grad()
with autocast(): # forward in fp16
outputs = model(inputs.cuda())
loss = loss_fn(outputs, targets.cuda())
scaler.scale(loss).backward() # backward in fp16 with scaled loss
scaler.step(optimizer) # unscale gradients, update in fp32
scaler.update() # adjust scale factor
The choice between float16 and bfloat16 is a tradeoff between precision and range. Float16 gives you 5 exponent bits and 10 mantissa bits — decent precision, but a narrow range that requires loss scaling. Bfloat16 gives you 8 exponent bits (the same as float32!) but only 7 mantissa bits — coarser precision, but the same range as float32. In practice, bfloat16 almost never overflows or underflows, which means you can often skip loss scaling entirely. This is why Google designed it for their TPUs, and why NVIDIA's A100 and H100 both support it natively. If your hardware supports bfloat16, use it. It's the less fussy option.
There's also FP8, which emerged in 2023-2024 on NVIDIA's Hopper architecture (H100). It comes in two flavors: E4M3 (4 exponent, 3 mantissa bits) for forward passes and E5M2 (5 exponent, 2 mantissa bits) for backward passes. The H100's Tensor Cores operate natively on FP8, offering up to 4x throughput over float16. My intuition for when FP8 is safe is still developing, but the trajectory is clear — each hardware generation pushes viable precision lower.
The practical impact of mixed precision is enormous: it roughly halves memory usage (which means you can double your batch size or fit a larger model), and runs 1.5-3x faster on Tensor Core-enabled hardware. In 2024, there is no good reason to train in full float32 unless you're actively debugging a numerical issue. Mixed precision isn't an optimization. It's the default.
Never reduce precision for the weight update itself. Accumulating tiny gradient updates into float16 weights destroys training through rounding errors — the updates are smaller than float16's epsilon relative to the weights, so they vanish. The pattern is sacred: forward in fp16, backward in fp16, accumulate and update in fp32. Always keep a float32 master copy of your weights.
Wrap-Up
If you're still with me, thank you. I hope it was worth it.
We started with the unsettling fact that 0.1 + 0.2 ≠ 0.3, traced through how IEEE 754 stores numbers on an unevenly-spaced ruler, watched catastrophic cancellation destroy a variance computation, learned the log-sum-exp trick that makes softmax and log-likelihoods safe, detoured through condition numbers and why inverting matrices is almost always wrong, peeked at the GPU memory hierarchy that actually determines training speed, and landed on the mixed-precision regime that has become standard practice for all serious training.
My hope is that the next time you see NaN in a training log, instead of blindly reducing the learning rate and restarting, you'll have a pretty darn good mental model of where to look — an unprotected exp(), a variance computed with the textbook formula, a condition number that's blown up, or a gradient that underflowed to zero in float16. The numbers don't lie, but they do approximate. Now you know how to work with that.
Resources
- What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg — the O.G. deep dive. Dense but unforgettable once you push through it.
- Numerical Linear Algebra by Trefethen and Bau — the gold standard textbook. The chapters on conditioning and stability are wildly clear.
- NVIDIA's Mixed Precision Training Guide — the definitive practical reference for mixed precision with PyTorch.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness by Tri Dao et al. — a beautiful example of how understanding memory hierarchy changes everything.
- The Floating-Point Guide — a gentle, visual introduction to floating-point arithmetic. Great place to start if IEEE 754 still feels foggy.
✅ What You Should Now Be Able To Do
- Explain why
0.1 + 0.2 != 0.3in terms of IEEE 754 binary representation and the unevenly-spaced number line - Identify catastrophic cancellation in code (variance formulas, loss functions) and reach for stable alternatives like Welford's algorithm and
log1p - Implement the log-sum-exp trick from memory, and explain why PyTorch's
CrossEntropyLosswants raw logits, not probabilities - Choose between
np.linalg.solve, Cholesky, and iterative solvers based on matrix properties, and explain why you never invert - Read a condition number and know when a solution is trustworthy vs. garbage
- Explain why GPU memory bandwidth, not compute, is usually the bottleneck, and why FlashAttention helps
- Set up mixed-precision training with
autocastandGradScaler, and explain why the weight update must stay in fp32 - Compare float16, bfloat16, and FP8 — their tradeoffs, when to use each, and why bfloat16 is often preferred
- Debug NaN in training: systematically check for unprotected
exp(),log(0), ill-conditioned systems, and gradient underflow