ML Learning Hub
Foundationsintermediate

Information Theory

Entropy, cross-entropy, KL divergence — the math behind why loss functions work

Entropy, cross-entropy loss, KL divergence, and mutual information — the mathematical backbone behind why cross-entropy works as a loss function, how VAEs work, and why transformers use attention.

35 min
7 diagrams
7 Concepts Covered

Prerequisites

Probability & Statistics

Concepts Covered

EntropyCross-EntropyKL DivergenceMutual InformationInformation GainLog LossBits

Key Formulas

Entropy

Average 'surprise' in bits — maximum when all outcomes equally likely, zero when deterministic

Cross-Entropy Loss

Expected bits needed to encode samples from p using code designed for q — the classification loss

KL Divergence

Extra bits needed to encode p with a code optimized for q. Always ≥ 0, equals 0 iff p=q

Mutual Information

How much knowing Y reduces uncertainty about X — used in feature selection and representation learning

Interactive Simulation

Loading visualization…
🎯

Why Information Theory Underpins ML Loss Functions

motivation

When you train a classifier with cross-entropy loss, you're minimizing the number of 'bits' needed to communicate ground-truth labels using the model's predicted distribution. When a VAE minimizes the ELBO, the regularization term is a KL divergence between the learned latent distribution and a prior. When you measure a decision tree split with information gain, you're computing the reduction in entropy. The connection to information theory is not an accident — it provides a principled, unified framework for understanding why these seemingly ad-hoc choices of loss functions are actually optimal for their respective goals.

Cross-entropy H(p,q) = Entropy H(p) + KL(p‖q). Since H(p) is fixed given the data, minimizing cross-entropy IS minimizing KL divergence from model q to truth p.

💡

Entropy: Measuring Surprise

intuition

Think of entropy as the average surprise in a probability distribution. A fair coin (50/50) has entropy H = 1 bit — you gain exactly 1 bit of information on each flip. A biased coin (99/1) has near-zero entropy — you're rarely surprised. A uniform distribution over 256 outcomes has entropy H = 8 bits — you need 8 bits to describe each outcome. ML application: a well-calibrated model's predictions on a class boundary have high entropy (uncertain), and its predictions on clear examples have near-zero entropy (confident). Entropy-regularized RL (Soft Actor-Critic) maximizes expected reward PLUS entropy to encourage exploration.

Maximum entropy principle: given constraints, choose the distribution that maximizes entropy. This gives the Normal distribution for mean+variance constraints — it's the least informative/assumptive choice.

</>

Entropy, Cross-Entropy & KL Divergence in Practice

code
python65 lines
import numpy as np
from scipy.special import xlogy    class="tok-comment"># handles class="tok-num">0 * log(class="tok-num">0) = class="tok-num">0 safely
from scipy.stats import entropy as scipy_entropy
import matplotlib.pyplot as plt

def entropy(p: np.ndarray, base: float = class="tok-num">2) -> float:
    class="tok-str">"""Shannon entropy H(p) in bits (base=class="tok-num">2) or nats (base=e)"""
    p = np.asarray(p, dtype=float)
    p = p[p > class="tok-num">0]                  class="tok-comment"># class="tok-num">0 * log(class="tok-num">0) = class="tok-num">0 by convention
    return -np.sum(p * np.log(p) / np.log(base))

def cross_entropy(p: np.ndarray, q: np.ndarray, eps: float = class="tok-num">1e-12) -> float:
    class="tok-str">"""H(p, q) = -sum p * log(q)"""
    p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float)
    return -np.sum(p * np.log(q + eps))

def kl_divergence(p: np.ndarray, q: np.ndarray, eps: float = class="tok-num">1e-12) -> float:
    class="tok-str">"""KL(p||q) — NOT symmetric"""
    p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float)
    mask = p > class="tok-num">0
    return np.sum(p[mask] * np.log((p[mask] + eps) / (q[mask] + eps)))

class="tok-comment"># ── class="tok-num">1. Entropy of various distributions ──────────────────────────────────────
print(class="tok-str">"Entropy examples (bits):")
print(fclass="tok-str">"  Fair coin [class="tok-num">0.5, class="tok-num">0.5]:        {entropy([class="tok-num">0.5, class="tok-num">0.5]):.4f}")  class="tok-comment"># class="tok-num">1.0 bit
print(fclass="tok-str">"  Biased coin [class="tok-num">0.99, class="tok-num">0.01]:    {entropy([class="tok-num">0.99, class="tok-num">0.01]):.4f}")  class="tok-comment"># ≈ class="tok-num">0.08 bits
print(fclass="tok-str">"  Uniform class="tok-num">8 classes:           {entropy([class="tok-num">1/class="tok-num">8]*class="tok-num">8):.4f}")  class="tok-comment"># class="tok-num">3.0 bits
print(fclass="tok-str">"  Certain [class="tok-num">1.0, class="tok-num">0.0]:          {entropy([class="tok-num">1.0, class="tok-num">0.0]):.4f}")  class="tok-comment"># class="tok-num">0.0 bits

class="tok-comment"># ── class="tok-num">2. Cross-entropy loss (classification) ────────────────────────────────────
class="tok-comment"># Ground truth (one-hot): cat
p_true = np.array([class="tok-num">1., class="tok-num">0., class="tok-num">0.])       class="tok-comment"># cat
class="tok-comment"># Model predictions:
q_good = np.array([class="tok-num">0.8, class="tok-num">0.1, class="tok-num">0.1])   class="tok-comment"># confident & correct
q_bad  = np.array([class="tok-num">0.1, class="tok-num">0.8, class="tok-num">0.1])   class="tok-comment"># confident & wrong
q_uncertain = np.array([class="tok-num">0.4, class="tok-num">0.3, class="tok-num">0.3])  class="tok-comment"># uncertain & correct lean

print(class="tok-str">"\nCross-entropy losses:")
print(fclass="tok-str">"  Good prediction:    {cross_entropy(p_true, q_good):.4f}")   class="tok-comment"># low
print(fclass="tok-str">"  Bad prediction:     {cross_entropy(p_true, q_bad):.4f}")    class="tok-comment"># high
print(fclass="tok-str">"  Uncertain but ok:   {cross_entropy(p_true, q_uncertain):.4f}")

class="tok-comment"># H(p,q) = H(p) + KL(p||q). Since H(p)=class="tok-num">0 for one-hot: CE = KL(p||q)
print(fclass="tok-str">"  KL(p_true||q_good) = {kl_divergence(p_true, q_good):.4f}")

class="tok-comment"># ── class="tok-num">3. KL divergence: asymmetry ───────────────────────────────────────────────
p = np.array([class="tok-num">0.6, class="tok-num">0.3, class="tok-num">0.1])
q = np.array([class="tok-num">0.3, class="tok-num">0.5, class="tok-num">0.2])
print(fclass="tok-str">"\nKL(p||q) = {kl_divergence(p,q):.4f}")
print(fclass="tok-str">"KL(q||p) = {kl_divergence(q,p):.4f}")  class="tok-comment"># different — not a distance

class="tok-comment"># ── class="tok-num">4. Information gain in decision trees ─────────────────────────────────────
def information_gain(parent, left, right):
    n = len(parent)
    n_l, n_r = len(left), len(right)
    h_p = scipy_entropy(np.bincount(parent) / n, base=class="tok-num">2)
    h_l = scipy_entropy(np.bincount(left)   / n_l, base=class="tok-num">2) if n_l > class="tok-num">0 else class="tok-num">0
    h_r = scipy_entropy(np.bincount(right)  / n_r, base=class="tok-num">2) if n_r > class="tok-num">0 else class="tok-num">0
    return h_p - (n_l/n * h_l + n_r/n * h_r)

class="tok-comment"># class="tok-num">10 samples: class="tok-num">6 class-class="tok-num">0, class="tok-num">4 class-class="tok-num">1. Split: left=[class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">1], right=[class="tok-num">0,class="tok-num">0,class="tok-num">1,class="tok-num">1,class="tok-num">1]
parent = np.array([class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">1,class="tok-num">1,class="tok-num">1,class="tok-num">1])
left   = np.array([class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">0,class="tok-num">1])
right  = np.array([class="tok-num">0,class="tok-num">0,class="tok-num">1,class="tok-num">1,class="tok-num">1])
print(fclass="tok-str">"\nInformation gain: {information_gain(parent, left, right):.4f} bits")
🔭

KL Divergence in Modern ML

insight

KL divergence appears everywhere in modern ML: (1) VAE loss = reconstruction loss + KL(q(z|x) ‖ p(z)) — the KL term regularizes the latent space toward the prior. (2) Policy gradient RL — TRPO/PPO constrain the KL between old and new policy to avoid catastrophic updates. (3) Knowledge distillation — train student network to minimize KL between its outputs and the teacher's soft predictions. (4) RLHF (ChatGPT-style training) — KL penalty prevents the fine-tuned model from diverging too far from the base model during reward optimization. The asymmetry of KL matters: KL(p‖q) penalizes q assigning zero probability where p has mass (mode-covering), KL(q‖p) penalizes q having mass where p is zero (mode-seeking).

Forward KL (mode-covering) vs reverse KL (mode-seeking) is a fundamental design choice in generative models — VAEs use forward KL, GANs implicitly use reverse.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.