Language Models
What you'll learn
- How n-gram language models assign probabilities to word sequences
- Why smoothing is essential and how different methods redistribute probability mass
- The intuition behind Good-Turing, Kneser-Ney, and Katz back-off
- How to evaluate language models using perplexity
This chapter builds on entropy and perplexity from Chapter 4: Information Theory. Make sure you're comfortable with those concepts before diving in.
Introduction
A language model assigns a probability to every possible sequence of words. Given a sentence like "the cat sat on the," a good language model should predict that "mat" or "floor" are likely next words, while "elephant" or "algorithm" are not. This simple idea is the backbone of spell-checking, machine translation, speech recognition, and modern text generation.
The fundamental question is: how do we estimate \(P(w_1, w_2, \ldots, w_n)\), the probability of a complete word sequence? By the chain rule of probability, we can decompose it:
But estimating the full conditional \(P(w_i \mid w_1, \ldots, w_{i-1})\) is impossible for long sequences -- we'd never see the exact same history twice. The Markov assumption saves us: we approximate by looking at only the last \(k{-}1\) words. This gives us n-gram models, the classical workhorses of NLP.
N-gram Probability
Think of it this way: Instead of asking "what's the probability of this word given the entire preceding history?", we ask "what's the probability of this word given just the last one or two words?" A bigram model looks at one word of context. A trigram looks at two. It's a simplification, but a remarkably effective one.
Worked Example
Corpus: "the cat sat on the mat . the cat ate the fish ." Count bigram P(sat | cat):
-
Count bigram "cat sat":
\(C(\text{cat sat}) = 1\) -
Count unigram "cat" (the context):
\(C(\text{cat}) = 2\) (appears before "sat" and before "ate") -
Apply the formula:
\(P(\text{sat} \mid \text{cat}) = \frac{C(\text{cat sat})}{C(\text{cat})} = \frac{1}{2}\) = 0.50
According to this bigram model, given "cat" the next word is "sat" with 50% probability and "ate" with 50% probability.
Laplace (Add-One) Smoothing
Think of it this way: Pretend you saw every possible bigram at least once, even the ones that never actually appeared. You add 1 to every count, then re-normalize. It guarantees nothing has zero probability, but it steals a lot of probability mass from common events and redistributes it to rare ones.
Worked Example
Suppose "the" appears 100 times as a bigram context, "the cat" appears 8 times, and |V| = 500:
-
Without smoothing:
\(P(\text{cat} \mid \text{the}) = \frac{8}{100} = 0.080\) -
With Laplace smoothing:
\(P_{\text{Laplace}}(\text{cat} \mid \text{the}) = \frac{8 + 1}{100 + 500} = \frac{9}{600}\) = 0.015 -
A word never seen after "the" (count = 0):
\(P_{\text{Laplace}}(\text{unicorn} \mid \text{the}) = \frac{0 + 1}{100 + 500} = \frac{1}{600}\) = 0.0017
Notice how "the cat" dropped from 8.0% to 1.5%. Laplace smoothing is very aggressive -- it drastically reshapes the distribution.
Good-Turing Smoothing
Think of it this way: Imagine you're a birdwatcher who has seen 10 species exactly once. If someone asks "how likely is it that the next bird you see is a new species?", Good-Turing says: look at how many species you saw exactly once. That count tells you something about how many species you haven't seen at all. The frequency of frequencies is the key insight.
Worked Example
Suppose in our bigram counts we have the following frequency-of-frequencies:
- \(N_0\) = 74 bigrams never seen
- \(N_1\) = 10 bigrams seen exactly once
- \(N_2\) = 5 bigrams seen exactly twice
- \(N_3\) = 3 bigrams seen exactly three times
-
Adjust count for bigrams seen once (\(c=1\)):
\(c^* = (1+1) \times \frac{N_2}{N_1} = 2 \times \frac{5}{10}\) = 1.0 -
Adjust count for bigrams seen twice (\(c=2\)):
\(c^* = (2+1) \times \frac{N_3}{N_2} = 3 \times \frac{3}{5}\) = 1.8 -
Total probability mass for unseen bigrams (\(c=0\)):
\(P_0 = \frac{N_1}{N} = \frac{10}{N}\) where N is the total count of all bigram tokens.
Items seen once get their count reduced from 1.0 to 1.0 (unchanged here, but often less), while the "saved" mass goes to unseen events.
Kneser-Ney Smoothing
Think of it this way: Consider the word "Francisco." It's very frequent, but only after "San." If we back off to the unigram probability, we'd predict "Francisco" as a common next word everywhere -- which is wrong. Kneser-Ney fixes this: instead of raw frequency, it asks "in how many different contexts does this word appear?" The continuation probability measures versatility, not raw frequency.
Worked Example
Bigram counts after "the": "the cat" = 8, "the dog" = 5, "the fish" = 3. D = 0.75. Suppose Pcont(cat) = 0.12.
-
Discounted first term for "cat":
\(\frac{\max(8 - 0.75,\, 0)}{8 + 5 + 3} = \frac{7.25}{16}\) = 0.453 -
Compute \(\lambda(\text{the})\):
Three word types follow "the", so \(\lambda = \frac{0.75}{16} \times 3 = 0.141\) -
Combine discounted term + back-off:
\(P_{\text{KN}}(\text{cat} \mid \text{the}) = 0.453 + 0.141 \times 0.12\) = 0.470
The probability is slightly lower than MLE (0.50) because some mass has been redistributed to unseen continuations.
Katz Back-off
Think of it this way: If you've seen "cat sat" enough times, trust that bigram count (with a Good-Turing discount). If you've never seen "cat danced," don't give it zero probability -- instead, "back off" to asking "how common is 'danced' by itself?" You use the best evidence available, falling back to less specific models only when you must.
Worked Example
Suppose C(cat) = 10. "cat sat" = 4, "cat ate" = 3, "cat watched" = 2, "cat chased" = 1. Good-Turing discount d = 0.9 for all. P(danced) = 0.002.
-
For a seen bigram "cat sat":
\(P_{\text{Katz}}(\text{sat} \mid \text{cat}) = 0.9 \times \frac{4}{10}\) = 0.36 -
Total discounted mass for seen bigrams:
\(0.9 \times \frac{4 + 3 + 2 + 1}{10} = 0.9 \times 1.0 = 0.90\) -
Leftover mass for back-off: \(1 - 0.90 = 0.10\)
For an unseen bigram "cat danced":
\(P_{\text{Katz}}(\text{danced} \mid \text{cat}) = \alpha(\text{cat}) \times P(\text{danced})\)
where \(\alpha\) distributes the 0.10 leftover proportionally among unseen words.
Interactive: N-gram Predictor with Smoothing
Enter a training corpus below, type a context (one or two words), and see how different smoothing methods change the predicted next-word probabilities. The demo builds a bigram language model and shows the top predictions alongside the model's perplexity.
Summary: Smoothing Methods Compared
| Method | Key Idea | Strengths | Weaknesses |
|---|---|---|---|
| No Smoothing (MLE) | Use raw counts directly | Simple, unbiased for seen events | Zero probability for unseen n-grams |
| Laplace (Add-One) | Add 1 to all counts | Simple, no zeros | Moves too much mass to unseen events, poor perplexity |
| Good-Turing | Use frequency of frequencies | Data-driven redistribution, elegant theory | Unstable for high counts, doesn't handle back-off |
| Katz Back-off | Good-Turing discount + hierarchical back-off | Principled combination of discounting and back-off | Back-off uses raw unigram (not context-aware) |
| Kneser-Ney | Absolute discount + continuation probability | Best perplexity, captures word versatility | More complex to implement |
All smoothing methods solve the same fundamental problem: what probability should we assign to events we've never seen? The methods differ in how cleverly they redistribute mass from seen events to unseen ones. Kneser-Ney's insight -- using continuation probability instead of raw frequency for back-off -- makes it the best classical approach. Modern neural language models learn their own implicit "smoothing" through distributed representations, which we'll explore in Chapter 7.