Information Theory

Chapter 2: Mathematical Foundations

I avoided information theory for longer than I should have. Every time I saw "entropy" in a paper, my brain filed it under "physics thing, skip for now." Every time I saw a KL divergence term in a loss function, I'd copy the code, nod politely, and move on without asking what it actually meant. Finally the discomfort grew too great — I kept bumping into these ideas in loss functions, in VAE papers, in decision tree internals, in RL objectives — and I realized information theory wasn't some side topic. It was the language my tools had been speaking all along. Here is that dive.

Information theory was born in 1948 when Claude Shannon published "A Mathematical Theory of Communication." He was trying to solve a practical engineering problem: how do you transmit messages reliably over a noisy wire? In the process, he gave us a mathematical framework for quantifying uncertainty, surprise, and the cost of being wrong. It turns out those concepts are exactly what machine learning needs.

Before we start, a heads-up. We're going to be working with logarithms, probability distributions, and a few summation formulas. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

Contents

Surprise and Entropy
    Why Logarithms?
    Differential Entropy
    Conditional and Joint Entropy
    Entropy in the Wild
Cross-Entropy: Measuring How Wrong Your Model Is
KL Divergence: The Exact Cost of Being Wrong
    Forward vs Reverse KL
    Jensen-Shannon Divergence
Rest Stop
Mutual Information: What Do These Variables Share?
    Feature Selection That Actually Works
    Contrastive Learning and InfoNCE
The Data Processing Inequality
The Information Bottleneck
Fisher Information
The Bigger Picture: Why ML Speaks Information Theory

Surprise and Entropy

Let's start with something concrete. Imagine you're building a tiny weather app for a city where it's sunny 364 days a year. Every morning you predict "sunny." When you're right — which is almost always — there's no surprise. It was sunny yesterday, sunny the day before, sunny the day before that. But on that one rainy day? That's a genuine shock. That one outcome carries a lot of information, precisely because it was so unlikely.

This is the core insight that everything else builds on: unlikely events carry more information than likely ones. Shannon formalized "surprise" for a single event with probability p as:

surprise(x) = -log₂(p(x))

When p = 1 (certainty), surprise is zero — you already knew what would happen. When p is tiny, surprise is large. The negative sign is there because log of a number between 0 and 1 is negative, and we want surprise to be positive.

Now, entropy is the average surprise across all possible outcomes, weighted by how likely each one is:

H(X) = -Σ p(xᵢ) log₂ p(xᵢ)

Back to our weather app. If this city has sunshine 99% of the time and rain 1% of the time, the entropy is about 0.08 bits. Almost no surprise on average — you can predict the weather with barely any information. A city where it's sunny half the time and rainy half the time has entropy of 1 bit — maximum uncertainty for two outcomes. And a city with three equally likely weather states (sunny, rainy, cloudy) hits about 1.58 bits. More possible outcomes, more uncertainty, more bits needed to describe what happened.

The pattern: entropy is maximized when all outcomes are equally likely and minimized when one outcome is certain. Think of it like a filing cabinet. A cabinet where every folder is in perfect order — one labeled "A," and everything goes in it — has zero entropy. A cabinet where documents are scattered randomly across dozens of folders? High entropy. You need a lot of information to describe where anything is.

Why Logarithms?

I'll be honest — the log base was the thing that tripped me up for years. Why log at all? Why not some other function that's big for small numbers and small for big numbers?

The reason is that logarithms make information additive. If you flip two independent fair coins, you want the total information to be 1 bit + 1 bit = 2 bits, not some weird product. Logarithms are the only function that turns multiplication into addition, and since joint probabilities of independent events multiply (P(A and B) = P(A) × P(B)), logs turn that into a sum. That's it. That's the whole reason.

As for the base — log base 2 gives you bits, log base e (natural log) gives you nats, log base 10 gives you hartleys. In ML we usually see nats because natural logs play nicely with calculus and gradients. The choice of base is a unit conversion, like Celsius vs Fahrenheit. The underlying physics doesn't change.

Differential Entropy

When your random variable is continuous, you replace the sum with an integral: h(X) = -∫ p(x) log p(x) dx. A Gaussian with variance σ² has differential entropy ½ log(2πeσ²) — wider Gaussians have higher entropy, which makes sense: more spread means more uncertainty about where a sample will land.

One oddity: unlike discrete entropy, differential entropy can be negative. A very peaked distribution concentrates so much probability in a tiny region that the average surprise dips below zero. This tripped me up the first time I computed it, but it rarely causes practical problems. It's a quirk of the continuous formulation, not a sign that something went wrong.

Conditional and Joint Entropy

Back to our weather app. Suppose we add a barometer reading to our system. Conditional entropy H(weather | barometer) measures how much uncertainty remains about the weather after you've read the barometer. If the barometer perfectly predicts the weather, conditional entropy is zero — no leftover surprise. If the barometer is broken and gives random readings, knowing the barometer tells you nothing, so H(weather | barometer) = H(weather).

Joint entropy H(weather, barometer) captures the total uncertainty in both variables together. The chain rule connects them: H(X, Y) = H(X) + H(Y|X). The total uncertainty equals the uncertainty in X plus whatever uncertainty remains in Y once you know X. This decomposition isn't decoration — it's the engine behind how decision trees choose splits.

Entropy in the Wild

Here's where entropy stops being abstract. In decision trees, the "information gain" at every split node is H(Y) - H(Y|X) — the reduction in target entropy when you split on feature X. ID3, C4.5, and their descendants are doing information theory at every single node. The feature that gives you the most information gain is the one that most reduces your uncertainty about the label. That's the one the tree picks.

Entropy also shows up as a regularizer in reinforcement learning. In Soft Actor-Critic (SAC), an entropy bonus is added to the policy's objective. A policy with higher entropy tries a wider variety of actions instead of collapsing prematurely to a single deterministic choice. The entropy term literally quantifies "how much is this policy exploring?" and the algorithm optimizes for a balance between reward and exploration.

import numpy as np

# Our weather app: three cities with different predictability
# City A: almost always sunny
city_a = np.array([0.99, 0.005, 0.005])
# City B: equal chance of anything
city_b = np.array([1/3, 1/3, 1/3])
# City C: usually sunny, sometimes rainy, rarely cloudy
city_c = np.array([0.7, 0.2, 0.1])

def entropy(probs):
    # Filter zeros to avoid log(0)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))

for name, p in [("Predictable", city_a), ("Chaotic", city_b), ("Realistic", city_c)]:
    print(f"{name} city: H = {entropy(p):.3f} bits")
# Predictable city: H = 0.080 bits
# Chaotic city:     H = 1.585 bits
# Realistic city:   H = 1.157 bits

The predictable city barely needs any bits to describe its weather. The chaotic city needs the most. Our filing cabinet analogy holds: the messier the distribution, the more information you need to pin down the outcome.

Cross-Entropy: Measuring How Wrong Your Model Is

Entropy tells us about a distribution's inherent uncertainty. But in ML we care about something different: we have the true distribution P (the real data) and a model distribution Q (our neural network's predictions). How do we measure how wrong Q is?

Imagine our weather app predicts sunny 90%, rainy 8%, cloudy 2%. But the real weather is sunny 70%, rainy 20%, cloudy 10%. Our model is overconfident about sunshine and underestimates rain. Cross-entropy quantifies this mismatch:

H(P, Q) = -Σ p(x) log q(x)

Think of it this way. Entropy H(P) tells you the minimum average bits needed to encode the weather if you had a perfect model. Cross-entropy H(P, Q) tells you how many bits you actually need when using your imperfect model Q. The difference is waste — bits spent on being wrong.

If Q perfectly matches P, cross-entropy equals entropy — you're using the optimal encoding. If Q is a bad approximation, cross-entropy is higher. You're like a tourist using a phrasebook for the wrong country. You can sort of communicate, but you're spending way more effort than a local would.

Here's the thing that clicked for me: when you call nn.CrossEntropyLoss() in PyTorch, you're computing exactly this. The one-hot label is your P (the truth), the softmax output is your Q (the model's belief). The loss measures how many extra bits your model wastes because it doesn't perfectly match reality. For binary classification, it reduces to the formula you've seen everywhere: -[y log(q) + (1-y) log(1-q)]. That formula? Pure information theory.

And here's a connection that took me embarrassingly long to internalize: minimizing cross-entropy loss is equivalent to maximum likelihood estimation. Both are asking "make Q match P as closely as possible." They arrive from different directions — one from information theory, one from statistics — and meet at exactly the same optimization problem.

KL Divergence: The Exact Cost of Being Wrong

Cross-entropy bundles two things together: the inherent uncertainty in the data (which you can't reduce) and the penalty for being wrong (which you can). KL divergence (Kullback-Leibler divergence) isolates the second part — the pure overhead of using Q instead of P:

D_KL(P ‖ Q) = Σ p(x) log [p(x) / q(x)]

The fundamental relationship tying these three quantities together is: H(P, Q) = H(P) + D_KL(P ‖ Q). Cross-entropy equals the true entropy plus the KL divergence. Since H(P) is a constant — the real data distribution doesn't change during training — minimizing cross-entropy is mathematically identical to minimizing KL divergence. This is not a coincidence or an approximation. It's an identity. It's why cross-entropy loss works.

KL divergence is always non-negative. You can never do better than the true distribution's own entropy. This fact, called Gibbs' inequality, means the best possible cross-entropy loss is H(P) — and you achieve it only when Q = P exactly.

Forward vs Reverse KL

Here's where it gets subtle, and I'll be honest — I mixed forward and reverse KL up repeatedly before the distinction finally stuck.

KL divergence is asymmetric: D_KL(P ‖ Q) ≠ D_KL(Q ‖ P). This isn't a minor technicality. The direction you choose fundamentally changes your model's behavior.

Forward KL, D_KL(P ‖ Q), penalizes Q heavily wherever P has mass but Q doesn't. The model thinks "I must cover everything the data might produce." The result is mode-covering — Q spreads out to blanket all of P, even if it means assigning probability to regions between modes where P has little mass. This is what maximum likelihood estimation does.

Reverse KL, D_KL(Q ‖ P), penalizes Q for placing mass where P is near zero. The model thinks "I must not put probability anywhere the data wouldn't go." The result is mode-seeking — Q locks onto one mode of P and ignores the others. This is what variational inference does.

Picture a target distribution P that has two peaks — say, two clusters of weather patterns. If Q can only be a single Gaussian, forward KL will stretch Q across both peaks (covering both but fitting neither well). Reverse KL will collapse Q tightly onto one peak (fitting that peak well but completely ignoring the other). Neither is "right" — it depends on whether missing a mode (reverse) or being imprecise everywhere (forward) is worse for your application.

This is why variational inference tends to produce posteriors that are too narrow. It uses reverse KL. The approximate posterior would rather be confidently wrong about some modes than risk placing probability where the true posterior is low.

Let's see this concretely with our weather app:

import numpy as np

# True weather distribution (two distinct seasonal patterns blended)
P = np.array([0.4, 0.35, 0.25])  # sunny, rainy, cloudy

# Model A: spreads probability broadly (mode-covering behavior)
Q_cover = np.array([0.33, 0.34, 0.33])

# Model B: locks onto the dominant mode (mode-seeking behavior)
Q_seek = np.array([0.7, 0.2, 0.1])

def kl_divergence(p, q):
    # Only where p > 0
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

print(f"Forward KL(P || Q_cover): {kl_divergence(P, Q_cover):.4f} nats")
print(f"Forward KL(P || Q_seek):  {kl_divergence(P, Q_seek):.4f} nats")
print(f"Reverse KL(Q_cover || P): {kl_divergence(Q_cover, P):.4f} nats")
print(f"Reverse KL(Q_seek || P):  {kl_divergence(Q_seek, P):.4f} nats")
# Notice: the two directions rank the models differently

KL divergence is not a distance metric — it violates both symmetry and the triangle inequality. Say "KL divergence from P to Q," never "KL distance between P and Q." The direction encodes a design decision about your model's behavior.

Where does KL divergence show up in practice? Everywhere. In VAEs, the ELBO loss includes a KL term pushing the encoder's approximate posterior toward a standard Gaussian prior. In knowledge distillation, you minimize KL divergence between the teacher's soft outputs and the student's predictions. In PPO and TRPO, a KL penalty prevents the new policy from straying too far from the old one, keeping training stable. Each of these uses KL divergence as a leash — "don't let Q wander too far from the reference."

Jensen-Shannon Divergence

The asymmetry of KL divergence is sometimes a problem. Jensen-Shannon divergence (JSD) fixes it by averaging both directions through a midpoint distribution M = ½(P + Q):

JSD(P ‖ Q) = ½ D_KL(P ‖ M) + ½ D_KL(Q ‖ M)

JSD is symmetric, always bounded between 0 and 1 (with log base 2), and its square root is a proper metric. The original GAN's discriminator loss implicitly estimates JSD between real and generated data. That's elegant, but there's a practical problem: when the real and generated distributions don't overlap (common early in training), JSD saturates and provides no useful gradient. This is exactly why Wasserstein GANs were invented — the Wasserstein distance provides meaningful gradients even when distributions are disjoint.

Our phrasebook analogy from earlier extends nicely here. KL divergence is like measuring how much harder it is for you (speaking model-Q-language) to communicate in country-P. JSD is like sitting both speakers at a table and measuring the total miscommunication in both directions. Wasserstein distance asks a different question entirely: how much physical effort would it take to rearrange one language into the other?

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a mental model that covers the three most important ideas in information theory for ML: entropy measures uncertainty, cross-entropy measures how wrong your model is, and KL divergence isolates the pure cost of being wrong. You know that every time you train with cross-entropy loss, you're minimizing KL divergence from the true data distribution. You know the difference between forward and reverse KL and why it matters for variational inference. That's a genuinely useful toolkit.

It doesn't tell the whole story, though. We haven't talked about what to do when you want to measure the relationship between two variables (not just compare two distributions). We haven't talked about the fundamental limits of what any model can learn from processed data. And we haven't talked about why deep networks seem to naturally compress their representations.

The short version: mutual information captures shared knowledge between variables, the data processing inequality says you can never create information by transforming data, and the information bottleneck suggests that deep learning is fundamentally about learning the right compression. There. You're 80% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Mutual Information: What Do These Variables Share?

So far we've been comparing a model Q to the truth P. But sometimes the question is different. You have two variables — say, barometer readings and tomorrow's weather — and you want to know: how much does one tell you about the other?

Mutual information I(X; Y) answers exactly this: how much does knowing X reduce your uncertainty about Y?

I(X; Y) = H(Y) - H(Y|X)

That's it. Start with your uncertainty about Y. Subtract the uncertainty that remains after observing X. The difference is the information that X provides about Y. There are equivalent ways to write this: I(X; Y) = H(X) + H(Y) - H(X, Y). The symmetry is real and beautiful: X tells you as much about Y as Y tells you about X.

If X and Y are independent, I(X; Y) = 0 — knowing one tells you nothing about the other. A broken barometer has zero mutual information with the weather. If X completely determines Y, then I(X; Y) = H(Y) — all uncertainty is eliminated. A perfect barometer has mutual information equal to the weather's entropy.

The Venn diagram picture is the one that finally made it click for me. Draw two overlapping circles: H(X) and H(Y). The overlap is mutual information — the shared uncertainty. The total area of both circles is joint entropy H(X, Y). The part of Y's circle that doesn't overlap with X is the conditional entropy H(Y|X). Every information-theoretic relationship becomes a statement about areas of those circles.

Feature Selection That Actually Works

Here's why mutual information matters for practical ML work. Pearson correlation measures linear relationships. If Y = X², and X is symmetric around zero, Pearson correlation between X and Y is exactly zero — it sees no relationship at all. Mutual information sees the full picture because it captures any statistical dependency, linear or not.

Scikit-learn's mutual_info_classif and mutual_info_regression exploit this directly. They estimate mutual information between each feature and the target, giving you a feature importance score that works for nonlinear relationships where correlation fails silently.

import numpy as np
from sklearn.feature_selection import mutual_info_classif

# Feature X1: linearly related to target
# Feature X2: nonlinearly related (X2 determines target, but nonlinearly)
# Feature X3: pure noise
np.random.seed(42)
n = 1000
X1 = np.random.randn(n)
X2 = np.random.randn(n)
X3 = np.random.randn(n)

# Target depends on X1 linearly and X2 quadratically
y = (X1 > 0).astype(int) ^ (X2**2 > 1).astype(int)

X = np.column_stack([X1, X2, X3])
mi = mutual_info_classif(X, y, random_state=42)

# Pearson correlation for comparison
corr = [np.abs(np.corrcoef(X[:, i], y)[0, 1]) for i in range(3)]

print("Feature | Mutual Info | |Correlation|")
for i, name in enumerate(["Linear", "Nonlinear", "Noise"]):
    print(f"  {name:10s} | {mi[i]:.3f}       | {corr[i]:.3f}")
# MI catches the nonlinear feature. Correlation misses it.

The estimation is the hard part. Computing mutual information exactly requires knowing the true joint distribution, which we never have. In practice, k-nearest-neighbor estimators (the Kraskov method) approximate it from samples. These work reasonably well in low dimensions but struggle as dimensionality grows — the curse of dimensionality hits MI estimation hard.

Contrastive Learning and InfoNCE

Modern contrastive learning (SimCLR, CPC, CLIP) has an elegant information-theoretic interpretation. The InfoNCE loss, introduced by van den Oord et al. in 2018, is a lower bound on mutual information between two "views" of the same data point. When you train CLIP to align image and text embeddings, you're maximizing mutual information between visual and linguistic representations of the same concept.

In InfoGAN, the generator maximizes mutual information between a subset of latent codes and the generated output. The result: those latent codes learn to control interpretable, disentangled factors — rotation, thickness, style — because high mutual information means the code actually determines the variation.

The Data Processing Inequality

Here's one of the most profound results in information theory, and it fits in one line. If X → Y → Z form a chain (Z depends on X only through Y), then:

I(X; Z) ≤ I(X; Y)

Processing data can only destroy information, never create it. Every transformation, every function, every layer in your neural network can only lose information about the original input. If your preprocessing step throws away some detail from the raw signal, no model downstream can recover it. Ever.

This sounds bleak, but it's actually the key insight that makes deep learning make sense. A good neural network isn't trying to preserve all information. It's trying to throw away the right information. A photo of a dog contains information about the breed, the lighting, the camera sensor noise, the JPEG compression artifacts, and the precise shade of the background wall. A breed classifier should keep the breed information and aggressively discard everything else.

Think of our filing cabinet again. You have a massive cabinet full of every document ever generated (the raw input). Your job isn't to build a bigger cabinet — it's to build a smaller one that keeps the documents you need for your task and shreds the rest. Each layer of a neural network is a clerk deciding what to keep and what to shred. The data processing inequality says the clerk can never un-shred a document.

This has a practical consequence that's easy to overlook: feature engineering sets the ceiling. If you feed noisy, poorly processed data into a pipeline, the data processing inequality guarantees that no model — no matter how sophisticated — can exceed the information present in that noisy input. Cleaning data and extracting good features up front raises the ceiling for everything downstream. The fanciest architecture in the world can't compensate for information that was destroyed before it arrived.

Skip connections in ResNets make intuitive sense through this lens. By providing a direct path from earlier layers to later ones, skip connections let information bypass potentially destructive transformations. They raise the information floor of deep networks.

The Information Bottleneck

Naftali Tishby and colleagues formalized a beautiful idea: the goal of a learned representation is to compress the input while preserving what matters about the target. Given input X, representation T, and target Y, the information bottleneck objective minimizes:

I(X; T) - β · I(T; Y)

The first term says "compress — reduce how much T remembers about X." The second says "but keep the parts that predict Y." The parameter β controls the tradeoff. High β means "prioritize prediction accuracy." Low β means "prioritize compression."

In 2017, Shwartz-Ziv and Tishby proposed that deep network training has two distinct phases, visible on an "information plane" that plots I(T; X) vs I(T; Y) for each layer across training. First a fitting phase: mutual information with the target Y increases as the network learns to predict. Then a compression phase: mutual information with the input X decreases as the network learns to forget irrelevant details.

I'm still developing my intuition for whether this compression phase is truly universal. The evidence is compelling for networks with saturating activations like tanh, where gradient noise naturally drives compression. For ReLU networks, the story is murkier — some studies find compression, others don't. The information bottleneck remains one of the most beautiful theoretical lenses for understanding deep learning, but the debate about its universality is genuine and ongoing.

What's not debatable is the principle itself: good representations compress. An embedding that preserves everything about the input is useless (it's the input). An embedding that throws away everything is also useless (it's a constant). The sweet spot — retaining what matters for the task and discarding the rest — is exactly what the information bottleneck formalizes.

Fisher Information

Everything we've discussed so far — entropy, cross-entropy, KL divergence, mutual information — measures properties of distributions. Fisher information asks a different question: how sensitive is a distribution to changes in its parameters?

I(θ) = E[(∂/∂θ log p(x; θ))²]

This is the expected squared gradient of the log-likelihood. Equivalently, it's the negative expected second derivative — the curvature of the log-likelihood surface. High Fisher information at a parameter value θ means the likelihood landscape has sharp curvature there: the data is very informative about θ, and small changes in θ produce large changes in how the data looks.

I haven't figured out a great way to visualize Fisher information, but here's a crude attempt. Imagine you're standing at a point on a hilly landscape. Fisher information tells you how steep the hills are around you. If the terrain is sharply curved (high Fisher information), you can pinpoint your position precisely — a small step in any direction puts you at a noticeably different altitude. If the terrain is flat (low Fisher information), you could wander a long way without the view changing — the data can't tell you much about exactly where you are.

The Cramér-Rao bound makes this precise: no unbiased estimator of θ can have variance lower than 1/I(θ). More Fisher information means tighter bounds on how accurately you can estimate a parameter. This is the fundamental limit on learning from data — not a weakness of your algorithm, but a property of the problem itself.

In deep learning, Fisher information powers natural gradient descent. Ordinary gradient descent treats parameter space as flat — a step of size 0.01 means the same thing everywhere. But parameter space isn't flat. Nudging a weight in one layer might barely affect the output, while the same nudge in another layer could be catastrophic. The Fisher information matrix captures this curved geometry. Natural gradient descent adjusts the gradient by the inverse Fisher information matrix, so step sizes are measured in terms of how much the model's output distribution changes, not how much the raw parameters move. This is the idea behind K-FAC and similar second-order optimizers.

Elastic Weight Consolidation (EWC) takes this further. When training on task B after task A, EWC uses the diagonal of the Fisher information matrix from task A to identify which parameters were "important" — high Fisher information means the data was very sensitive to that parameter. Those important parameters get penalized for changing during task B training: L_EWC = L_B + (λ/2) Σ Fᵢ(θᵢ - θ*ᵢ)². Parameters the old task didn't care about (low Fᵢ) are free to change. Parameters the old task depended on (high Fᵢ) are held in place. It's a principled solution to catastrophic forgetting, grounded in information geometry.

The Bigger Picture: Why ML Speaks Information Theory

If you're still with me, thank you. I hope it was worth it.

We started with surprise — the observation that unlikely events carry more information than likely ones. From there we built entropy (average surprise), cross-entropy (the cost of a wrong model), and KL divergence (the pure waste from being wrong). We discovered that the loss function you use every day is an information-theoretic quantity. We moved on to mutual information (shared knowledge between variables), the data processing inequality (you can never un-destroy information), the information bottleneck (good representations compress intelligently), and Fisher information (how sensitive your model is to parameter changes).

These aren't separate tools that happen to share a mathematical heritage. They're different faces of a single insight: learning is compression. A model that generalizes well has found a compact description of the patterns in data — keeping the signal, discarding the noise.

The Minimum Description Length (MDL) principle makes this explicit: the best model is the one that compresses the data most efficiently. The total cost is bits for the model description plus bits for the data given the model. A complex model (many parameters) is expensive to describe but encodes the data cheaply. A simple model is cheap to describe but may encode data poorly. MDL says minimize the sum. This is the bias-variance tradeoff, restated in bits. L1/L2 regularization, dropout, pruning — they're all reducing the model's description length.

Variational inference is KL divergence turned into a computational tool. Find the distribution q(z) that minimizes D_KL(q(z) ‖ p(z|x)) — make your approximate posterior as close as possible to the true posterior. Since the true posterior is intractable, we maximize the Evidence Lower Bound (ELBO) instead, which is mathematically equivalent. This powers VAEs. Normalizing flows and diffusion models can be understood as making q(z) more flexible to reduce this KL gap.

Recent theoretical work even uses mutual information to bound generalization error. The intuition: if a learning algorithm's output has high mutual information with the specific training set it saw, it has memorized training-specific noise and will generalize poorly. Data augmentation, dropout, and noise injection all work in part because they reduce this memorization, lowering the mutual information between the learned model and the particular training examples.

My hope is that the next time you see a KL divergence term in a loss function, or hear someone mention "information gain" in a decision tree, or wonder why your cross-entropy loss plateaus at some non-zero value, instead of nodding politely and moving on, you'll recognize the information-theoretic machinery underneath — and have a pretty solid mental model of what it's doing and why.

import numpy as np
from scipy.stats import entropy as sp_entropy

def entropy_bits(probs):
    """Entropy in bits."""
    return sp_entropy(probs, base=2)

def cross_entropy(p, q):
    """Cross-entropy H(P, Q) in nats."""
    return -np.sum(p * np.log(q))

def kl_div(p, q):
    """KL divergence D_KL(P || Q) in nats."""
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

def jsd(p, q):
    """Jensen-Shannon divergence in nats."""
    m = 0.5 * (p + q)
    return 0.5 * kl_div(p, m) + 0.5 * kl_div(q, m)

def info_gain(parent, left, right):
    """Information gain for a binary split (decision tree)."""
    n = len(parent)
    h_parent = entropy_bits(np.bincount(parent) / n)
    h_left = entropy_bits(np.bincount(left) / len(left))
    h_right = entropy_bits(np.bincount(right) / len(right))
    return h_parent - (len(left)/n * h_left + len(right)/n * h_right)

# --- Demo: the full information theory toolkit ---

P = np.array([0.7, 0.2, 0.1])    # true weather distribution
Q = np.array([0.5, 0.3, 0.2])    # model's prediction

print("=== Core Quantities ===")
print(f"Entropy H(P):           {sp_entropy(P):.4f} nats")
print(f"Cross-entropy H(P,Q):   {cross_entropy(P, Q):.4f} nats")
print(f"KL divergence D(P||Q):  {kl_div(P, Q):.4f} nats")
print(f"Verify H(P,Q) = H(P) + KL: {sp_entropy(P) + kl_div(P, Q):.4f} nats")
print(f"JSD(P, Q):              {jsd(P, Q):.4f} nats")

print("\n=== Decision Tree Split ===")
labels = np.array([0, 0, 0, 0, 0, 1, 1, 1])  # 5 sunny, 3 rainy
left = np.array([1, 1, 1, 0])   # high-humidity group
right = np.array([0, 0, 0, 0])  # low-humidity group
print(f"Information gain: {info_gain(labels, left, right):.3f} bits")