Chapter 06

Language Models

5 formulas Intermediate

What you'll learn

  • How n-gram language models assign probabilities to word sequences
  • Why smoothing is essential and how different methods redistribute probability mass
  • The intuition behind Good-Turing, Kneser-Ney, and Katz back-off
  • How to evaluate language models using perplexity
Prerequisites

This chapter builds on entropy and perplexity from Chapter 4: Information Theory. Make sure you're comfortable with those concepts before diving in.

Introduction

A language model assigns a probability to every possible sequence of words. Given a sentence like "the cat sat on the," a good language model should predict that "mat" or "floor" are likely next words, while "elephant" or "algorithm" are not. This simple idea is the backbone of spell-checking, machine translation, speech recognition, and modern text generation.

The fundamental question is: how do we estimate \(P(w_1, w_2, \ldots, w_n)\), the probability of a complete word sequence? By the chain rule of probability, we can decompose it:

$$P(w_1, w_2, \ldots, w_n) = \prod_{i=1}^{n} P(w_i \mid w_1, \ldots, w_{i-1})$$

But estimating the full conditional \(P(w_i \mid w_1, \ldots, w_{i-1})\) is impossible for long sequences -- we'd never see the exact same history twice. The Markov assumption saves us: we approximate by looking at only the last \(k{-}1\) words. This gives us n-gram models, the classical workhorses of NLP.

N-gram Probability

Think of it this way: Instead of asking "what's the probability of this word given the entire preceding history?", we ask "what's the probability of this word given just the last one or two words?" A bigram model looks at one word of context. A trigram looks at two. It's a simplification, but a remarkably effective one.

N-gram Probability (MLE)
$$P(\textcolor{#0284c7}{w_n} \mid \textcolor{#e11d48}{w_{n-k+1}}, \ldots, \textcolor{#e11d48}{w_{n-1}}) = \frac{\textcolor{#059669}{C(w_{n-k+1}, \ldots, w_n)}}{\textcolor{#e11d48}{C(w_{n-k+1}, \ldots, w_{n-1})}}$$
C(wn-k+1,...,wn) Count of the full n-gram in the training corpus
C(wn-k+1,...,wn-1) Count of the (n-1)-gram context (the history without the predicted word)
k The n-gram order: k=1 for unigram, k=2 for bigram, k=3 for trigram

Worked Example

Corpus: "the cat sat on the mat . the cat ate the fish ." Count bigram P(sat | cat):

  1. Count bigram "cat sat":
    \(C(\text{cat sat}) = 1\)
  2. Count unigram "cat" (the context):
    \(C(\text{cat}) = 2\)   (appears before "sat" and before "ate")
  3. Apply the formula:
    \(P(\text{sat} \mid \text{cat}) = \frac{C(\text{cat sat})}{C(\text{cat})} = \frac{1}{2}\) = 0.50

According to this bigram model, given "cat" the next word is "sat" with 50% probability and "ate" with 50% probability.

The sparsity problem: Maximum likelihood estimation (MLE) assigns zero probability to any n-gram not seen in training. If "cat danced" never appeared, \(P(\text{danced} \mid \text{cat}) = 0\). This is catastrophic -- a single unseen bigram makes the probability of the entire sentence zero. Smoothing methods (below) fix this.

Laplace (Add-One) Smoothing

Think of it this way: Pretend you saw every possible bigram at least once, even the ones that never actually appeared. You add 1 to every count, then re-normalize. It guarantees nothing has zero probability, but it steals a lot of probability mass from common events and redistributes it to rare ones.

Laplace (Add-One) Smoothing
$$P_{\text{Laplace}}(\textcolor{#0284c7}{w} \mid \textcolor{#e11d48}{\text{context}}) = \frac{\textcolor{#059669}{C(w, \text{context})} + \textcolor{#d97706}{1}}{\textcolor{#e11d48}{C(\text{context})} + \textcolor{#d97706}{|V|}}$$
C(w, context) Count of the word following this context in the training data
C(context) Total count of this context (how many times the history appeared)
|V| Vocabulary size (number of unique word types). Adding 1 to each of |V| items means adding |V| to the denominator

Worked Example

Suppose "the" appears 100 times as a bigram context, "the cat" appears 8 times, and |V| = 500:

  1. Without smoothing:
    \(P(\text{cat} \mid \text{the}) = \frac{8}{100} = 0.080\)
  2. With Laplace smoothing:
    \(P_{\text{Laplace}}(\text{cat} \mid \text{the}) = \frac{8 + 1}{100 + 500} = \frac{9}{600}\) = 0.015
  3. A word never seen after "the" (count = 0):
    \(P_{\text{Laplace}}(\text{unicorn} \mid \text{the}) = \frac{0 + 1}{100 + 500} = \frac{1}{600}\) = 0.0017

Notice how "the cat" dropped from 8.0% to 1.5%. Laplace smoothing is very aggressive -- it drastically reshapes the distribution.

Simple but flawed: Laplace smoothing moves far too much probability mass to unseen events. With a typical NLP vocabulary of 50,000+ words, most of the denominator comes from the |V| term, swamping the actual counts. It's useful for teaching and for tasks with small vocabularies (like text classification) but is rarely used for language models in practice.

Good-Turing Smoothing

Think of it this way: Imagine you're a birdwatcher who has seen 10 species exactly once. If someone asks "how likely is it that the next bird you see is a new species?", Good-Turing says: look at how many species you saw exactly once. That count tells you something about how many species you haven't seen at all. The frequency of frequencies is the key insight.

Good-Turing Adjusted Count
$$\textcolor{#0284c7}{c^*} = (\textcolor{#059669}{c} + 1) \times \frac{\textcolor{#e11d48}{N_{c+1}}}{\textcolor{#e11d48}{N_c}}$$
c The original count of an n-gram in the corpus
c* The adjusted (smoothed) count
Nc The number of n-grams that occur exactly c times (frequency of frequency c)
Nc+1 The number of n-grams that occur exactly c+1 times

Worked Example

Suppose in our bigram counts we have the following frequency-of-frequencies:

  • \(N_0\) = 74 bigrams never seen
  • \(N_1\) = 10 bigrams seen exactly once
  • \(N_2\) = 5 bigrams seen exactly twice
  • \(N_3\) = 3 bigrams seen exactly three times
  1. Adjust count for bigrams seen once (\(c=1\)):
    \(c^* = (1+1) \times \frac{N_2}{N_1} = 2 \times \frac{5}{10}\) = 1.0
  2. Adjust count for bigrams seen twice (\(c=2\)):
    \(c^* = (2+1) \times \frac{N_3}{N_2} = 3 \times \frac{3}{5}\) = 1.8
  3. Total probability mass for unseen bigrams (\(c=0\)):
    \(P_0 = \frac{N_1}{N} = \frac{10}{N}\)   where N is the total count of all bigram tokens.

Items seen once get their count reduced from 1.0 to 1.0 (unchanged here, but often less), while the "saved" mass goes to unseen events.

vs. Laplace: Good-Turing uses the data itself to decide how much mass to redistribute. It doesn't blindly add the same constant to every count. The intuition is beautifully simple: the number of things you saw once tells you about the number of things you haven't seen at all.

Kneser-Ney Smoothing

Think of it this way: Consider the word "Francisco." It's very frequent, but only after "San." If we back off to the unigram probability, we'd predict "Francisco" as a common next word everywhere -- which is wrong. Kneser-Ney fixes this: instead of raw frequency, it asks "in how many different contexts does this word appear?" The continuation probability measures versatility, not raw frequency.

Kneser-Ney Smoothing
$$P_{\text{KN}}(\textcolor{#0284c7}{w} \mid \textcolor{#e11d48}{w_{i-1}}) = \frac{\max(\textcolor{#059669}{C(w_{i-1}, w)} - \textcolor{#d97706}{D},\, 0)}{\textcolor{#e11d48}{C(w_{i-1})}} + \textcolor{#7c3aed}{\lambda}(w_{i-1}) \cdot \textcolor{#0891b2}{P_{\text{cont}}}(w)$$
D Absolute discount (typically 0.75). A fixed amount subtracted from every nonzero count
λ(wi-1) Normalizing weight that ensures probabilities sum to 1. Equals \(\frac{D}{C(w_{i-1})} \times |\{w : C(w_{i-1}, w) > 0\}|\)
Pcont(w) Continuation probability: proportion of unique contexts w appears in. \(P_{\text{cont}}(w) = \frac{|\{v : C(v, w) > 0\}|}{\sum_{w'} |\{v : C(v, w') > 0\}|}\)

Worked Example

Bigram counts after "the": "the cat" = 8, "the dog" = 5, "the fish" = 3. D = 0.75. Suppose Pcont(cat) = 0.12.

  1. Discounted first term for "cat":
    \(\frac{\max(8 - 0.75,\, 0)}{8 + 5 + 3} = \frac{7.25}{16}\) = 0.453
  2. Compute \(\lambda(\text{the})\):
    Three word types follow "the", so \(\lambda = \frac{0.75}{16} \times 3 = 0.141\)
  3. Combine discounted term + back-off:
    \(P_{\text{KN}}(\text{cat} \mid \text{the}) = 0.453 + 0.141 \times 0.12\) = 0.470

The probability is slightly lower than MLE (0.50) because some mass has been redistributed to unseen continuations.

The gold standard: Kneser-Ney (and its modified variant by Chen & Goodman, 1999) consistently outperforms all other classical smoothing methods in perplexity evaluations. The continuation probability is the key innovation -- it captures versatility rather than raw frequency.

Katz Back-off

Think of it this way: If you've seen "cat sat" enough times, trust that bigram count (with a Good-Turing discount). If you've never seen "cat danced," don't give it zero probability -- instead, "back off" to asking "how common is 'danced' by itself?" You use the best evidence available, falling back to less specific models only when you must.

Katz Back-off
$$P_{\text{Katz}}(\textcolor{#0284c7}{w_n} \mid \textcolor{#e11d48}{w_{n-1}}) = \begin{cases} \textcolor{#059669}{d_c} \cdot \dfrac{C(w_{n-1}, w_n)}{C(w_{n-1})} & \text{if } C(w_{n-1}, w_n) > 0 \\[8pt] \textcolor{#7c3aed}{\alpha}(w_{n-1}) \cdot P(w_n) & \text{if } C(w_{n-1}, w_n) = 0 \end{cases}$$
dc Good-Turing discount ratio for count c: \(d_c = c^*/c\), where c* is the Good-Turing adjusted count
α(wn-1) Back-off weight. Distributes the leftover probability mass (from discounting) among unseen bigrams
P(wn) Lower-order (unigram) probability, used as the back-off distribution

Worked Example

Suppose C(cat) = 10. "cat sat" = 4, "cat ate" = 3, "cat watched" = 2, "cat chased" = 1. Good-Turing discount d = 0.9 for all. P(danced) = 0.002.

  1. For a seen bigram "cat sat":
    \(P_{\text{Katz}}(\text{sat} \mid \text{cat}) = 0.9 \times \frac{4}{10}\) = 0.36
  2. Total discounted mass for seen bigrams:
    \(0.9 \times \frac{4 + 3 + 2 + 1}{10} = 0.9 \times 1.0 = 0.90\)
  3. Leftover mass for back-off: \(1 - 0.90 = 0.10\)
    For an unseen bigram "cat danced":
    \(P_{\text{Katz}}(\text{danced} \mid \text{cat}) = \alpha(\text{cat}) \times P(\text{danced})\)
    where \(\alpha\) distributes the 0.10 leftover proportionally among unseen words.
vs. Kneser-Ney: Both combine discounting with a back-off distribution. The key difference is what they back off to: Katz uses the raw lower-order probability \(P(w)\), while Kneser-Ney uses the continuation probability \(P_{\text{cont}}(w)\). Kneser-Ney's approach is more principled and performs better.

Interactive: N-gram Predictor with Smoothing

Enter a training corpus below, type a context (one or two words), and see how different smoothing methods change the predicted next-word probabilities. The demo builds a bigram language model and shows the top predictions alongside the model's perplexity.

Summary: Smoothing Methods Compared

Method Key Idea Strengths Weaknesses
No Smoothing (MLE) Use raw counts directly Simple, unbiased for seen events Zero probability for unseen n-grams
Laplace (Add-One) Add 1 to all counts Simple, no zeros Moves too much mass to unseen events, poor perplexity
Good-Turing Use frequency of frequencies Data-driven redistribution, elegant theory Unstable for high counts, doesn't handle back-off
Katz Back-off Good-Turing discount + hierarchical back-off Principled combination of discounting and back-off Back-off uses raw unigram (not context-aware)
Kneser-Ney Absolute discount + continuation probability Best perplexity, captures word versatility More complex to implement
Key Takeaway

All smoothing methods solve the same fundamental problem: what probability should we assign to events we've never seen? The methods differ in how cleverly they redistribute mass from seen events to unseen ones. Kneser-Ney's insight -- using continuation probability instead of raw frequency for back-off -- makes it the best classical approach. Modern neural language models learn their own implicit "smoothing" through distributed representations, which we'll explore in Chapter 7.