PMI and Its Variants
What you'll learn
- How PMI reveals hidden word associations from raw text
- Why raw PMI has problems with rare words, and how each variant fixes them
- The intuition behind logarithms in text analysis
- How to compute PMI by hand and interactively
Introduction
Imagine you're reading a news corpus and you notice that "stock" and "market" appear together far more often than you'd expect if they were independent. Pointwise Mutual Information (PMI) captures exactly this intuition: it measures how much more (or less) two words co-occur compared to what chance alone would predict.
PMI is one of the simplest and most powerful tools in computational linguistics. It's the foundation for collocation extraction, word similarity, and even modern word embeddings (GloVe is essentially a factorized PMI matrix — we'll see this in Chapter 7).
Pointwise Mutual Information (PMI)
Think of it this way: If "ice" and "cream" appear together much more than you'd expect from how often each appears alone, PMI is a large positive number. If they appear together less than expected, PMI is negative. If they're completely independent, PMI is zero.
Worked Example
Given a corpus of 10,000 word pairs, where "ice" appears in 200 pairs, "cream" in 150, and "ice cream" co-occurs in 80:
-
Calculate individual probabilities:
\(p(\text{ice}) = 200/10000 = 0.02\), \(p(\text{cream}) = 150/10000 = 0.015\) -
Calculate joint probability:
\(p(\text{ice, cream}) = 80/10000 = 0.008\) -
Plug into the formula:
\(\text{PMI} = \log_2 \frac{0.008}{0.02 \times 0.015} = \log_2 \frac{0.008}{0.0003} = \log_2(26.67)\) = 4.74 bits
A PMI of 4.74 means "ice cream" is about 27 times more likely to co-occur than chance predicts.
Positive PMI (PPMI)
Negative PMI values are unreliable — they usually just mean "we didn't see enough data." PPMI simply clips negative values to zero, keeping only the meaningful positive associations.
Normalized PMI (NPMI)
Raw PMI has no upper bound — rare co-occurrences can produce arbitrarily high values. NPMI normalizes to a clean [-1, +1] range: -1 means "never co-occur," 0 means "independent," +1 means "always co-occur."
PMI² (PMI-squared)
PMI tends to overweight rare events. PMI² counteracts this by squaring the joint probability in the numerator, giving more weight to word pairs that actually co-occur frequently.
PMI^k (Generalized)
PMI^k is the generalization: raise the joint probability to power k. When k=1, you get standard PMI. When k=2, you get PMI². Higher k values increasingly favor frequent co-occurrences over rare ones.
Shifted PMI (SPMI)
Shifted PMI subtracts a constant (log k) from PMI, effectively discounting low-frequency associations. The positive version (SPPMI) clips to zero after the shift. This is exactly what Word2Vec with negative sampling optimizes — connecting this simple formula to neural word embeddings.
Interactive: PMI Calculator
Enter some text below, pick two words, and see all PMI variants computed live.
Summary: When to Use Which
| Formula | Range | Rare-Word Bias | Best For |
|---|---|---|---|
| PMI | (-∞, +∞) | High (overweights rare) | Exploratory analysis, understanding associations |
| PPMI | [0, +∞) | High | Word-word matrices for SVD embeddings |
| NPMI | [-1, +1] | Low | Topic coherence evaluation, comparable scores |
| PMI² | (-∞, +∞) | Low | Collocation extraction (balances frequency + association) |
| PMI^k | (-∞, +∞) | Tunable via k | Experimentation, tuning the frequency-association trade-off |
| SPPMI | [0, +∞) | Low (shifted out) | Neural embedding equivalence, Word2Vec comparison |
All PMI variants measure the same core idea: do these words co-occur more than chance predicts? The variants differ in how they handle rare events, what range they output, and what downstream task they're tuned for.