Information Theory
“Entropy, cross-entropy, KL divergence — the math behind why loss functions work”
Entropy, cross-entropy loss, KL divergence, and mutual information — the mathematical backbone behind why cross-entropy works as a loss function, how VAEs work, and why transformers use attention.
Prerequisites
Concepts Covered
∑Key Formulas
Entropy
Average 'surprise' in bits — maximum when all outcomes equally likely, zero when deterministic
Cross-Entropy Loss
Expected bits needed to encode samples from p using code designed for q — the classification loss
KL Divergence
Extra bits needed to encode p with a code optimized for q. Always ≥ 0, equals 0 iff p=q
Mutual Information
How much knowing Y reduces uncertainty about X — used in feature selection and representation learning
▶Interactive Simulation
Why Information Theory Underpins ML Loss Functions
When you train a classifier with cross-entropy loss, you're minimizing the number of 'bits' needed to communicate ground-truth labels using the model's predicted distribution. When a VAE minimizes the ELBO, the regularization term is a KL divergence between the learned latent distribution and a prior. When you measure a decision tree split with information gain, you're computing the reduction in entropy. The connection to information theory is not an accident — it provides a principled, unified framework for understanding why these seemingly ad-hoc choices of loss functions are actually optimal for their respective goals.
Cross-entropy H(p,q) = Entropy H(p) + KL(p‖q). Since H(p) is fixed given the data, minimizing cross-entropy IS minimizing KL divergence from model q to truth p.
Entropy: Measuring Surprise
Think of entropy as the average surprise in a probability distribution. A fair coin (50/50) has entropy H = 1 bit — you gain exactly 1 bit of information on each flip. A biased coin (99/1) has near-zero entropy — you're rarely surprised. A uniform distribution over 256 outcomes has entropy H = 8 bits — you need 8 bits to describe each outcome. ML application: a well-calibrated model's predictions on a class boundary have high entropy (uncertain), and its predictions on clear examples have near-zero entropy (confident). Entropy-regularized RL (Soft Actor-Critic) maximizes expected reward PLUS entropy to encourage exploration.
Maximum entropy principle: given constraints, choose the distribution that maximizes entropy. This gives the Normal distribution for mean+variance constraints — it's the least informative/assumptive choice.
Entropy, Cross-Entropy & KL Divergence in Practice
import numpy as np from scipy.special import xlogy class="tok-comment"># handles class="tok-num">0 * log(class="tok-num">0) = class="tok-num">0 safely from scipy.stats import entropy as scipy_entropy import matplotlib.pyplot as plt def entropy(p: np.ndarray, base: float = class="tok-num">2) -> float: class="tok-str">"""Shannon entropy H(p) in bits (base=class="tok-num">2) or nats (base=e)""" p = np.asarray(p, dtype=float) p = p[p > class="tok-num">0] class="tok-comment"># class="tok-num">0 * log(class="tok-num">0) = class="tok-num">0 by convention return -np.sum(p * np.log(p) / np.log(base)) def cross_entropy(p: np.ndarray, q: np.ndarray, eps: float = class="tok-num">1e-12) -> float: class="tok-str">"""H(p, q) = -sum p * log(q)""" p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float) return -np.sum(p * np.log(q + eps)) def kl_divergence(p: np.ndarray, q: np.ndarray, eps: float = class="tok-num">1e-12) -> float: class="tok-str">"""KL(p||q) — NOT symmetric""" p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float) mask = p > class="tok-num">0 return np.sum(p[mask] * np.log((p[mask] + eps) / (q[mask] + eps))) class="tok-comment"># ── class="tok-num">1. Entropy of various distributions ────────────────────────────────────── print(class="tok-str">"Entropy examples (bits):") print(fclass="tok-str">" Fair coin [class="tok-num">0.5, class="tok-num">0.5]: {entropy([class="tok-num">0.5, class="tok-num">0.5]):.4f}") class="tok-comment"># class="tok-num">1.0 bit print(fclass="tok-str">" Biased coin [class="tok-num">0.99, class="tok-num">0.01]: {entropy([class="tok-num">0.99, class="tok-num">0.01]):.4f}") class="tok-comment"># ≈ class="tok-num">0.08 bits print(fclass="tok-str">" Uniform class="tok-num">8 classes: {entropy([class="tok-num">1/class="tok-num">8]*class="tok-num">8):.4f}") class="tok-comment"># class="tok-num">3.0 bits print(fclass="tok-str">" Certain [class="tok-num">1.0, class="tok-num">0.0]: {entropy([class="tok-num">1.0, class="tok-num">0.0]):.4f}") class="tok-comment"># class="tok-num">0.0 bits class="tok-comment"># ── class="tok-num">2. Cross-entropy loss (classification) ──────────────────────────────────── class="tok-comment"># Ground truth (one-hot): cat p_true = np.array([class="tok-num">1., class="tok-num">0., class="tok-num">0.]) class="tok-comment"># cat class="tok-comment"># Model predictions: q_good = np.array([class="tok-num">0.8, class="tok-num">0.1, class="tok-num">0.1]) class="tok-comment"># confident & correct q_bad = np.array([class="tok-num">0.1, class="tok-num">0.8, class="tok-num">0.1]) class="tok-comment"># confident & wrong q_uncertain = np.array([class="tok-num">0.4, class="tok-num">0.3, class="tok-num">0.3]) class="tok-comment"># uncertain & correct lean print(class="tok-str">"\nCross-entropy losses:") print(fclass="tok-str">" Good prediction: {cross_entropy(p_true, q_good):.4f}") class="tok-comment"># low print(fclass="tok-str">" Bad prediction: {cross_entropy(p_true, q_bad):.4f}") class="tok-comment"># high print(fclass="tok-str">" Uncertain but ok: {cross_entropy(p_true, q_uncertain):.4f}") class="tok-comment"># H(p,q) = H(p) + KL(p||q). Since H(p)=class="tok-num">0 for one-hot: CE = KL(p||q) print(fclass="tok-str">" KL(p_true||q_good) = {kl_divergence(p_true, q_good):.4f}") class="tok-comment"># ── class="tok-num">3. KL divergence: asymmetry ─────────────────────────────────────────────── p = np.array([class="tok-num">0.6, class="tok-num">0.3, class="tok-num">0.1]) q = np.array([class="tok-num">0.3, class="tok-num">0.5, class="tok-num">0.2]) print(fclass="tok-str">"\nKL(p||q) = {kl_divergence(p,q):.4f}") print(fclass="tok-str">"KL(q||p) = {kl_divergence(q,p):.4f}") class="tok-comment"># different — not a distance class="tok-comment"># ── class="tok-num">4. Information gain in decision trees ───────────────────────────────────── def information_gain(parent, left, right): n = len(parent) n_l, n_r = len(left), len(right) h_p = scipy_entropy(np.bincount(parent) / n, base=class="tok-num">2) h_l = scipy_entropy(np.bincount(left) / n_l, base=class="tok-num">2) if n_l > class="tok-num">0 else class="tok-num">0 h_r = scipy_entropy(np.bincount(right) / n_r, base=class="tok-num">2) if n_r > class="tok-num">0 else class="tok-num">0 return h_p - (n_l/n * h_l + n_r/n * h_r) class="tok-comment"># class="tok-num">10 samples: class="tok-num">6 class-class="tok-num">0, class="tok-num">4 class-class="tok-num">1. Split: left=[class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">1], right=[class="tok-num">0,class="tok-num">0,class="tok-num">1,class="tok-num">1,class="tok-num">1] parent = np.array([class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">1,class="tok-num">1,class="tok-num">1,class="tok-num">1]) left = np.array([class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">1]) right = np.array([class="tok-num">0,class="tok-num">0,class="tok-num">1,class="tok-num">1,class="tok-num">1]) print(fclass="tok-str">"\nInformation gain: {information_gain(parent, left, right):.4f} bits")
KL Divergence in Modern ML
KL divergence appears everywhere in modern ML: (1) VAE loss = reconstruction loss + KL(q(z|x) ‖ p(z)) — the KL term regularizes the latent space toward the prior. (2) Policy gradient RL — TRPO/PPO constrain the KL between old and new policy to avoid catastrophic updates. (3) Knowledge distillation — train student network to minimize KL between its outputs and the teacher's soft predictions. (4) RLHF (ChatGPT-style training) — KL penalty prevents the fine-tuned model from diverging too far from the base model during reward optimization. The asymmetry of KL matters: KL(p‖q) penalizes q assigning zero probability where p has mass (mode-covering), KL(q‖p) penalizes q having mass where p is zero (mode-seeking).
Forward KL (mode-covering) vs reverse KL (mode-seeking) is a fundamental design choice in generative models — VAEs use forward KL, GANs implicitly use reverse.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.