Information Theory
What you'll learn
- How Shannon entropy quantifies uncertainty and information content
- Why cross-entropy is the loss function behind every transformer language model
- How KL divergence measures the "cost" of using the wrong distribution
- Jensen-Shannon divergence as a symmetric, bounded alternative to KL
- Why perplexity is the standard evaluation metric for language models
Introduction
In 1948, Claude Shannon published "A Mathematical Theory of Communication" and invented the field of information theory. His central insight was deceptively simple: information is surprise. The less likely an event, the more information it carries when it occurs.
These ideas are not just theoretical curiosities. If you have ever trained a neural network, you have minimized cross-entropy. If you have ever evaluated a language model, you have reported perplexity. And if you have ever compared two probability distributions, you have (perhaps unknowingly) used KL divergence. This chapter builds the complete chain of formulas, from entropy to perplexity, showing how each one connects to the next.
The key relationships to keep in mind throughout this chapter:
Shannon Entropy
Think of it this way: Entropy measures the average "surprise" in a probability distribution. If you know with certainty what word comes next (probability 1.0), there is zero surprise and entropy is 0. If every word in a 10,000-word vocabulary is equally likely, you are maximally uncertain and entropy is at its maximum. Entropy answers the question: how many bits do I need, on average, to encode a sample from this distribution?
Worked Example
A tiny language has 4 words with probabilities: "the" = 0.5, "cat" = 0.25, "sat" = 0.125, "quietly" = 0.125.
-
Compute each term: \(-p(x) \log_2 p(x)\)
"the": \(-0.5 \times \log_2(0.5) = -0.5 \times (-1) = 0.5\)
"cat": \(-0.25 \times \log_2(0.25) = -0.25 \times (-2) = 0.5\)
"sat": \(-0.125 \times \log_2(0.125) = -0.125 \times (-3) = 0.375\)
"quietly": \(-0.125 \times \log_2(0.125) = 0.375\) -
Sum all terms:
\(H(X) = 0.5 + 0.5 + 0.375 + 0.375\) = 1.75 bits
Maximum entropy for 4 equally likely words would be \(\log_2(4) = 2.0\) bits. Our distribution's 1.75 bits reflects the fact that "the" is more predictable, reducing average surprise.
Cross-Entropy
Think of it this way: Cross-entropy measures the average number of bits needed to encode data from distribution P when you are using a code optimized for a different distribution Q. If Q matches P perfectly, cross-entropy equals entropy (the theoretical minimum). If Q is a bad approximation of P, cross-entropy is higher — you are wasting bits. This is exactly the training loss for transformer language models.
Worked Example
True distribution P: "the" = 0.5, "cat" = 0.25, "sat" = 0.125, "quietly" = 0.125.
Model distribution Q: "the" = 0.4, "cat" = 0.3, "sat" = 0.2, "quietly" = 0.1.
-
Compute each term: \(-p(x) \log_2 q(x)\)
"the": \(-0.5 \times \log_2(0.4) = -0.5 \times (-1.322) = 0.661\)
"cat": \(-0.25 \times \log_2(0.3) = -0.25 \times (-1.737) = 0.434\)
"sat": \(-0.125 \times \log_2(0.2) = -0.125 \times (-2.322) = 0.290\)
"quietly": \(-0.125 \times \log_2(0.1) = -0.125 \times (-3.322) = 0.415\) -
Sum all terms:
\(H(P, Q) = 0.661 + 0.434 + 0.290 + 0.415\) = 1.801 bits
Recall that the true entropy H(P) was 1.75 bits. The cross-entropy of 1.801 bits is higher — the extra 0.051 bits is the "wasted" encoding cost. That gap is the KL divergence.
KL Divergence
Think of it this way: KL divergence measures the extra bits wasted when you encode data from distribution P using distribution Q's code instead of P's own optimal code. It is the gap between cross-entropy and entropy: KL(P||Q) = H(P,Q) − H(P). Crucially, it is not symmetric: KL(P||Q) ≠ KL(Q||P). This is why it is called a "divergence" rather than a "distance."
Worked Example
Using the same P and Q from the cross-entropy example above:
-
We already know H(P) = 1.75 bits and H(P,Q) = 1.801 bits.
Direct computation: KL(P||Q) = H(P,Q) − H(P) = 1.801 − 1.750 = 0.051 bits -
Verify term by term: \(\sum p(x) \log_2 \frac{p(x)}{q(x)}\)
"the": \(0.5 \times \log_2(0.5/0.4) = 0.5 \times 0.322 = 0.161\)
"cat": \(0.25 \times \log_2(0.25/0.3) = 0.25 \times (-0.263) = -0.066\)
"sat": \(0.125 \times \log_2(0.125/0.2) = 0.125 \times (-0.678) = -0.085\)
"quietly": \(0.125 \times \log_2(0.125/0.1) = 0.125 \times 0.322 = 0.040\)
Total: \(0.161 - 0.066 - 0.085 + 0.040\) = 0.051 bits
The KL divergence of 0.051 bits tells us the model Q is quite close to the true distribution P. A KL divergence of 0 would mean a perfect model.
Jensen-Shannon Divergence
Think of it this way: KL divergence has two problems — it is asymmetric and it blows up to infinity when Q assigns zero probability to an event that P considers possible. Jensen-Shannon (JS) divergence fixes both: it averages P and Q into a mixture M, then measures how far each is from that mixture. The result is symmetric, always finite, and bounded between 0 and 1 (when using log base 2).
Worked Example
P = [0.5, 0.25, 0.125, 0.125] and Q = [0.4, 0.3, 0.2, 0.1] (same distributions as before).
-
Compute the mixture: M = (P + Q) / 2
M = [0.45, 0.275, 0.1625, 0.1125] -
Compute KL(P||M) and KL(Q||M):
KL(P||M) = 0.5 log₂(0.5/0.45) + 0.25 log₂(0.25/0.275) + 0.125 log₂(0.125/0.1625) + 0.125 log₂(0.125/0.1125) ≈ 0.0129
KL(Q||M) = 0.4 log₂(0.4/0.45) + 0.3 log₂(0.3/0.275) + 0.2 log₂(0.2/0.1625) + 0.1 log₂(0.1/0.1125) ≈ 0.0131 -
JS(P||Q) = 0.5 × 0.0129 + 0.5 × 0.0131 ≈ 0.013 bits
The JS divergence of 0.013 confirms these distributions are very similar. Its square root (≈ 0.114) is a proper distance metric.
Perplexity
Think of it this way: Perplexity is the exponential of cross-entropy. It tells you the effective vocabulary size the model is choosing from at each step. A perplexity of 100 means the model is, on average, as uncertain as if it were choosing uniformly among 100 words. Lower perplexity = better model. This is THE standard evaluation metric for language models.
Worked Example
A language model assigns the following probabilities to a 4-word test sequence "the cat sat quietly":
-
Model probabilities: q("the") = 0.4, q("cat") = 0.3, q("sat") = 0.2, q("quietly") = 0.1
-
Compute average log probability:
\(-\frac{1}{4}[\log_2(0.4) + \log_2(0.3) + \log_2(0.2) + \log_2(0.1)]\)
\(= -\frac{1}{4}[-1.322 - 1.737 - 2.322 - 3.322]\)
\(= -\frac{1}{4}(-8.703) = 2.176\) bits -
Exponentiate:
\(\text{PP} = 2^{2.176}\) = 4.52
A perplexity of 4.52 means the model is roughly as uncertain as choosing uniformly from 4–5 words at each step. For a 4-word vocabulary, this is close to random. State-of-the-art large language models achieve perplexities in the range of 5–20 on natural language, despite vocabularies of 50,000+ tokens.
Interactive Demos
Demo 1: Entropy Calculator
Enter text below to see its word-level probability distribution and Shannon entropy.
Demo 2: KL Divergence Explorer
Enter two texts to compare their word distributions. Notice how KL(P||Q) ≠ KL(Q||P) — divergence is asymmetric.
Summary: The Information Theory Chain
| Formula | Measures | Range | Key Property | NLP Application |
|---|---|---|---|---|
| Entropy H(P) | Uncertainty in P | [0, log₂V] | Minimum possible encoding cost | Vocabulary richness, text complexity |
| Cross-Entropy H(P,Q) | Cost of encoding P with Q | [H(P), +∞) | Always ≥ Entropy | Transformer training loss |
| KL Divergence | Extra bits wasted by Q | [0, +∞) | Asymmetric; = H(P,Q) − H(P) | Model comparison, variational inference |
| JS Divergence | Symmetric divergence | [0, 1] | Symmetric, bounded, sqrt is a metric | Corpus comparison, GANs |
| Perplexity | Effective vocabulary size | [1, +∞) | = 2cross-entropy | THE language model evaluation metric |
All five formulas are deeply connected. Entropy is the baseline. Cross-entropy adds the cost of using a wrong model. KL divergence isolates that extra cost. JS divergence makes it symmetric and bounded. Perplexity converts cross-entropy to an intuitive "effective choices" scale. Master these relationships and you understand the mathematical heart of language modeling.