Chapter 04

Information Theory

5 formulas Intermediate Prereq: Ch01

What you'll learn

  • How Shannon entropy quantifies uncertainty and information content
  • Why cross-entropy is the loss function behind every transformer language model
  • How KL divergence measures the "cost" of using the wrong distribution
  • Jensen-Shannon divergence as a symmetric, bounded alternative to KL
  • Why perplexity is the standard evaluation metric for language models

Introduction

In 1948, Claude Shannon published "A Mathematical Theory of Communication" and invented the field of information theory. His central insight was deceptively simple: information is surprise. The less likely an event, the more information it carries when it occurs.

These ideas are not just theoretical curiosities. If you have ever trained a neural network, you have minimized cross-entropy. If you have ever evaluated a language model, you have reported perplexity. And if you have ever compared two probability distributions, you have (perhaps unknowingly) used KL divergence. This chapter builds the complete chain of formulas, from entropy to perplexity, showing how each one connects to the next.

The key relationships to keep in mind throughout this chapter:

The information theory chain: Entropy ≤ Cross-Entropy  •  KL = Cross-Entropy − Entropy  •  JS symmetrizes KL  •  Perplexity = 2Cross-Entropy

Shannon Entropy

Think of it this way: Entropy measures the average "surprise" in a probability distribution. If you know with certainty what word comes next (probability 1.0), there is zero surprise and entropy is 0. If every word in a 10,000-word vocabulary is equally likely, you are maximally uncertain and entropy is at its maximum. Entropy answers the question: how many bits do I need, on average, to encode a sample from this distribution?

Shannon Entropy
$$H(\textcolor{#e11d48}{X}) = -\sum_{x} \textcolor{#2563eb}{p(x)} \cdot \log_2 \textcolor{#2563eb}{p(x)}$$
X A random variable (e.g., "the next word")
p(x) Probability of outcome x (e.g., probability of a specific word)
log₂ Logarithm base 2 — entropy is measured in bits

Worked Example

A tiny language has 4 words with probabilities: "the" = 0.5, "cat" = 0.25, "sat" = 0.125, "quietly" = 0.125.

  1. Compute each term: \(-p(x) \log_2 p(x)\)
    "the": \(-0.5 \times \log_2(0.5) = -0.5 \times (-1) = 0.5\)
    "cat": \(-0.25 \times \log_2(0.25) = -0.25 \times (-2) = 0.5\)
    "sat": \(-0.125 \times \log_2(0.125) = -0.125 \times (-3) = 0.375\)
    "quietly": \(-0.125 \times \log_2(0.125) = 0.375\)
  2. Sum all terms:
    \(H(X) = 0.5 + 0.5 + 0.375 + 0.375\) = 1.75 bits

Maximum entropy for 4 equally likely words would be \(\log_2(4) = 2.0\) bits. Our distribution's 1.75 bits reflects the fact that "the" is more predictable, reducing average surprise.

Bounds: For a vocabulary of size V, entropy ranges from 0 (one word has probability 1) to log₂(V) (uniform distribution). Real language has entropy far below the maximum because words are highly predictable from context.

Cross-Entropy

Think of it this way: Cross-entropy measures the average number of bits needed to encode data from distribution P when you are using a code optimized for a different distribution Q. If Q matches P perfectly, cross-entropy equals entropy (the theoretical minimum). If Q is a bad approximation of P, cross-entropy is higher — you are wasting bits. This is exactly the training loss for transformer language models.

Cross-Entropy
$$H(\textcolor{#e11d48}{P}, \textcolor{#2563eb}{Q}) = -\sum_{x} \textcolor{#e11d48}{p(x)} \cdot \log_2 \textcolor{#2563eb}{q(x)}$$
P The true distribution (e.g., one-hot vector of the actual next word)
Q The model's predicted distribution (e.g., softmax output of a transformer)
p(x) True probability of word x
q(x) Model's predicted probability of word x

Worked Example

True distribution P: "the" = 0.5, "cat" = 0.25, "sat" = 0.125, "quietly" = 0.125.
Model distribution Q: "the" = 0.4, "cat" = 0.3, "sat" = 0.2, "quietly" = 0.1.

  1. Compute each term: \(-p(x) \log_2 q(x)\)
    "the": \(-0.5 \times \log_2(0.4) = -0.5 \times (-1.322) = 0.661\)
    "cat": \(-0.25 \times \log_2(0.3) = -0.25 \times (-1.737) = 0.434\)
    "sat": \(-0.125 \times \log_2(0.2) = -0.125 \times (-2.322) = 0.290\)
    "quietly": \(-0.125 \times \log_2(0.1) = -0.125 \times (-3.322) = 0.415\)
  2. Sum all terms:
    \(H(P, Q) = 0.661 + 0.434 + 0.290 + 0.415\) = 1.801 bits

Recall that the true entropy H(P) was 1.75 bits. The cross-entropy of 1.801 bits is higher — the extra 0.051 bits is the "wasted" encoding cost. That gap is the KL divergence.

Key property: Cross-entropy is always ≥ entropy: \(H(P,Q) \geq H(P)\). Equality holds only when Q = P. This is why minimizing cross-entropy during training pushes the model toward the true distribution.

KL Divergence

Think of it this way: KL divergence measures the extra bits wasted when you encode data from distribution P using distribution Q's code instead of P's own optimal code. It is the gap between cross-entropy and entropy: KL(P||Q) = H(P,Q) − H(P). Crucially, it is not symmetric: KL(P||Q) ≠ KL(Q||P). This is why it is called a "divergence" rather than a "distance."

Kullback-Leibler Divergence
$$D_{\text{KL}}(\textcolor{#e11d48}{P} \| \textcolor{#2563eb}{Q}) = \sum_{x} \textcolor{#e11d48}{p(x)} \cdot \log_2 \frac{\textcolor{#e11d48}{p(x)}}{\textcolor{#2563eb}{q(x)}}$$
P The reference (true) distribution
Q The approximating (model) distribution
p(x)/q(x) The likelihood ratio — how much P and Q disagree on outcome x

Worked Example

Using the same P and Q from the cross-entropy example above:

  1. We already know H(P) = 1.75 bits and H(P,Q) = 1.801 bits.
    Direct computation: KL(P||Q) = H(P,Q) − H(P) = 1.801 − 1.750 = 0.051 bits
  2. Verify term by term: \(\sum p(x) \log_2 \frac{p(x)}{q(x)}\)
    "the": \(0.5 \times \log_2(0.5/0.4) = 0.5 \times 0.322 = 0.161\)
    "cat": \(0.25 \times \log_2(0.25/0.3) = 0.25 \times (-0.263) = -0.066\)
    "sat": \(0.125 \times \log_2(0.125/0.2) = 0.125 \times (-0.678) = -0.085\)
    "quietly": \(0.125 \times \log_2(0.125/0.1) = 0.125 \times 0.322 = 0.040\)
    Total: \(0.161 - 0.066 - 0.085 + 0.040\) = 0.051 bits

The KL divergence of 0.051 bits tells us the model Q is quite close to the true distribution P. A KL divergence of 0 would mean a perfect model.

Asymmetry warning: KL(P||Q) penalizes Q for assigning low probability where P has high probability. KL(Q||P) does the opposite. In practice, KL(P||Q) is called the "forward KL" (used in maximum likelihood / cross-entropy training), while KL(Q||P) is the "reverse KL" (used in variational inference). They produce very different model behaviors.

Jensen-Shannon Divergence

Think of it this way: KL divergence has two problems — it is asymmetric and it blows up to infinity when Q assigns zero probability to an event that P considers possible. Jensen-Shannon (JS) divergence fixes both: it averages P and Q into a mixture M, then measures how far each is from that mixture. The result is symmetric, always finite, and bounded between 0 and 1 (when using log base 2).

Jensen-Shannon Divergence
$$\text{JS}(\textcolor{#e11d48}{P} \| \textcolor{#2563eb}{Q}) = \frac{1}{2}\, D_{\text{KL}}(\textcolor{#e11d48}{P} \| \textcolor{#059669}{M}) + \frac{1}{2}\, D_{\text{KL}}(\textcolor{#2563eb}{Q} \| \textcolor{#059669}{M})$$ $$\text{where } \textcolor{#059669}{M} = \frac{1}{2}(\textcolor{#e11d48}{P} + \textcolor{#2563eb}{Q})$$
P, Q The two distributions to compare
M The mixture (average) of P and Q — ensures neither KL term explodes

Worked Example

P = [0.5, 0.25, 0.125, 0.125] and Q = [0.4, 0.3, 0.2, 0.1] (same distributions as before).

  1. Compute the mixture: M = (P + Q) / 2
    M = [0.45, 0.275, 0.1625, 0.1125]
  2. Compute KL(P||M) and KL(Q||M):
    KL(P||M) = 0.5 log₂(0.5/0.45) + 0.25 log₂(0.25/0.275) + 0.125 log₂(0.125/0.1625) + 0.125 log₂(0.125/0.1125) ≈ 0.0129
    KL(Q||M) = 0.4 log₂(0.4/0.45) + 0.3 log₂(0.3/0.275) + 0.2 log₂(0.2/0.1625) + 0.1 log₂(0.1/0.1125) ≈ 0.0131
  3. JS(P||Q) = 0.5 × 0.0129 + 0.5 × 0.0131 ≈ 0.013 bits

The JS divergence of 0.013 confirms these distributions are very similar. Its square root (≈ 0.114) is a proper distance metric.

vs. KL: JS is symmetric (JS(P||Q) = JS(Q||P)), bounded [0, 1], and never infinite. Its square root satisfies the triangle inequality, making it a true metric. This makes JS divergence ideal for comparing text corpora, topic distributions, or generative model outputs.

Perplexity

Think of it this way: Perplexity is the exponential of cross-entropy. It tells you the effective vocabulary size the model is choosing from at each step. A perplexity of 100 means the model is, on average, as uncertain as if it were choosing uniformly among 100 words. Lower perplexity = better model. This is THE standard evaluation metric for language models.

Perplexity
$$\text{PP}(\textcolor{#e11d48}{P}, \textcolor{#2563eb}{Q}) = 2^{H(\textcolor{#e11d48}{P}, \textcolor{#2563eb}{Q})} = 2^{-\frac{1}{\textcolor{#d97706}{N}} \sum_{i=1}^{N} \log_2 \textcolor{#2563eb}{q}(w_i)}$$
P The true data distribution (the test set)
Q The model's predicted distribution
N Number of tokens in the test set
q(wᵢ) The model's predicted probability of the i-th word

Worked Example

A language model assigns the following probabilities to a 4-word test sequence "the cat sat quietly":

  1. Model probabilities: q("the") = 0.4, q("cat") = 0.3, q("sat") = 0.2, q("quietly") = 0.1
  2. Compute average log probability:
    \(-\frac{1}{4}[\log_2(0.4) + \log_2(0.3) + \log_2(0.2) + \log_2(0.1)]\)
    \(= -\frac{1}{4}[-1.322 - 1.737 - 2.322 - 3.322]\)
    \(= -\frac{1}{4}(-8.703) = 2.176\) bits
  3. Exponentiate:
    \(\text{PP} = 2^{2.176}\) = 4.52

A perplexity of 4.52 means the model is roughly as uncertain as choosing uniformly from 4–5 words at each step. For a 4-word vocabulary, this is close to random. State-of-the-art large language models achieve perplexities in the range of 5–20 on natural language, despite vocabularies of 50,000+ tokens.

Relationship to cross-entropy: Perplexity = 2cross-entropy. If a model has cross-entropy of 3.0 bits, its perplexity is 23 = 8. A 10% reduction in cross-entropy produces a much larger reduction in perplexity because of the exponential.

Interactive Demos

Demo 1: Entropy Calculator

Enter text below to see its word-level probability distribution and Shannon entropy.

Demo 2: KL Divergence Explorer

Enter two texts to compare their word distributions. Notice how KL(P||Q) ≠ KL(Q||P) — divergence is asymmetric.

Summary: The Information Theory Chain

Formula Measures Range Key Property NLP Application
Entropy H(P) Uncertainty in P [0, log₂V] Minimum possible encoding cost Vocabulary richness, text complexity
Cross-Entropy H(P,Q) Cost of encoding P with Q [H(P), +∞) Always ≥ Entropy Transformer training loss
KL Divergence Extra bits wasted by Q [0, +∞) Asymmetric; = H(P,Q) − H(P) Model comparison, variational inference
JS Divergence Symmetric divergence [0, 1] Symmetric, bounded, sqrt is a metric Corpus comparison, GANs
Perplexity Effective vocabulary size [1, +∞) = 2cross-entropy THE language model evaluation metric
Key Takeaway

All five formulas are deeply connected. Entropy is the baseline. Cross-entropy adds the cost of using a wrong model. KL divergence isolates that extra cost. JS divergence makes it symmetric and bounded. Perplexity converts cross-entropy to an intuitive "effective choices" scale. Master these relationships and you understand the mathematical heart of language modeling.