Chapter 01

PMI and Its Variants

6 formulas Beginner

What you'll learn

How PMI reveals hidden word associations from raw text
Why raw PMI has problems with rare words, and how each variant fixes them
The intuition behind logarithms in text analysis
How to compute PMI by hand and interactively

Introduction

Imagine you're reading a news corpus and you notice that "stock" and "market" appear together far more often than you'd expect if they were independent. Pointwise Mutual Information (PMI) captures exactly this intuition: it measures how much more (or less) two words co-occur compared to what chance alone would predict.

PMI is one of the simplest and most powerful tools in computational linguistics. It's the foundation for collocation extraction, word similarity, and even modern word embeddings (GloVe is essentially a factorized PMI matrix — we'll see this in Chapter 7).

Pointwise Mutual Information (PMI)

Think of it this way: If "ice" and "cream" appear together much more than you'd expect from how often each appears alone, PMI is a large positive number. If they appear together less than expected, PMI is negative. If they're completely independent, PMI is zero.

Pointwise Mutual Information

$$\text{PMI}(\textcolor{#e11d48}{x}, \textcolor{#2563eb}{y}) = \log_2 \frac{\textcolor{#059669}{p(x, y)}}{\textcolor{#e11d48}{p(x)} \cdot \textcolor{#2563eb}{p(y)}}$$

p(x) Probability of word x appearing (count of x / total words)

p(y) Probability of word y appearing

p(x,y) Probability of x and y co-occurring (within a window)

Worked Example

Given a corpus of 10,000 word pairs, where "ice" appears in 200 pairs, "cream" in 150, and "ice cream" co-occurs in 80:

Calculate individual probabilities:
$p(\text{ice}) = 200/10000 = 0.02$, $p(\text{cream}) = 150/10000 = 0.015$
Calculate joint probability:
$p(\text{ice, cream}) = 80/10000 = 0.008$
Plug into the formula:
$\text{PMI} = \log_2 \frac{0.008}{0.02 \times 0.015} = \log_2 \frac{0.008}{0.0003} = \log_2(26.67)$ = 4.74 bits

A PMI of 4.74 means "ice cream" is about 27 times more likely to co-occur than chance predicts.

Interpretation scale: PMI = 0 means independence. Typical strong collocations have PMI between 3 and 10. Negative values mean the words avoid each other.

Positive PMI (PPMI)

Negative PMI values are unreliable — they usually just mean "we didn't see enough data." PPMI simply clips negative values to zero, keeping only the meaningful positive associations.

Positive PMI

$$\text{PPMI}(\textcolor{#e11d48}{x}, \textcolor{#2563eb}{y}) = \max\!\Big(\log_2 \frac{\textcolor{#059669}{p(x, y)}}{\textcolor{#e11d48}{p(x)} \cdot \textcolor{#2563eb}{p(y)}},\ 0\Big)$$

vs. PMI: Identical for positive values. Replaces all negative values with 0. This makes PPMI suitable as entries in a word-word matrix for downstream tasks like SVD-based embeddings.

Normalized PMI (NPMI)

Raw PMI has no upper bound — rare co-occurrences can produce arbitrarily high values. NPMI normalizes to a clean [-1, +1] range: -1 means "never co-occur," 0 means "independent," +1 means "always co-occur."

Normalized PMI

$$\text{NPMI}(\textcolor{#e11d48}{x}, \textcolor{#2563eb}{y}) = \frac{\text{PMI}(x, y)}{-\log_2 \textcolor{#059669}{p(x, y)}}$$

-log p(x,y) The self-information (surprisal) of the co-occurrence, used as the normalizer

-1 (never)

+1 (always)

vs. PMI: Same numerator, but divided by the maximum possible PMI for that word pair. This fixes the "rare word bias" where infrequent pairs get inflated PMI scores.

PMI² (PMI-squared)

PMI tends to overweight rare events. PMI² counteracts this by squaring the joint probability in the numerator, giving more weight to word pairs that actually co-occur frequently.

PMI-squared

$$\text{PMI}^2(\textcolor{#e11d48}{x}, \textcolor{#2563eb}{y}) = \log_2 \frac{\textcolor{#059669}{p(x, y)}^2}{\textcolor{#e11d48}{p(x)} \cdot \textcolor{#2563eb}{p(y)}}$$

vs. PMI: The extra p(x,y) factor in the numerator penalizes rare co-occurrences. A word pair needs both high association and reasonable frequency to score well.

PMI^k (Generalized)

PMI^k is the generalization: raise the joint probability to power k. When k=1, you get standard PMI. When k=2, you get PMI². Higher k values increasingly favor frequent co-occurrences over rare ones.

PMI to the power k

$$\text{PMI}^k(\textcolor{#e11d48}{x}, \textcolor{#2563eb}{y}) = \log_2 \frac{\textcolor{#059669}{p(x, y)}^{\textcolor{#d97706}{k}}}{\textcolor{#e11d48}{p(x)} \cdot \textcolor{#2563eb}{p(y)}}$$

k Tuning parameter. k=1 is standard PMI, k=2 is PMI², k=3 gives even more frequency weight.

The spectrum: As k increases, PMI^k transitions from favoring rare but strongly associated pairs (k=1) to favoring common, moderately associated pairs (k>2).

Shifted PMI (SPMI)

Shifted PMI subtracts a constant (log k) from PMI, effectively discounting low-frequency associations. The positive version (SPPMI) clips to zero after the shift. This is exactly what Word2Vec with negative sampling optimizes — connecting this simple formula to neural word embeddings.

Shifted PMI & Shifted PPMI

$$\text{SPMI}(x, y) = \text{PMI}(x, y) - \log_2 \textcolor{#d97706}{k}$$ $$\text{SPPMI}(x, y) = \max(\text{SPMI}(x, y),\ 0)$$

k The number of negative samples in Word2Vec. Typical value: 5 or 15. Larger k = more aggressive filtering.

Interactive: PMI Calculator

Enter some text below, pick two words, and see all PMI variants computed live.

Summary: When to Use Which

Formula	Range	Rare-Word Bias	Best For
PMI	(-∞, +∞)	High (overweights rare)	Exploratory analysis, understanding associations
PPMI	[0, +∞)	High	Word-word matrices for SVD embeddings
NPMI	[-1, +1]	Low	Topic coherence evaluation, comparable scores
PMI²	(-∞, +∞)	Low	Collocation extraction (balances frequency + association)
PMI^k	(-∞, +∞)	Tunable via k	Experimentation, tuning the frequency-association trade-off
SPPMI	[0, +∞)	Low (shifted out)	Neural embedding equivalence, Word2Vec comparison

Key Takeaway

All PMI variants measure the same core idea: do these words co-occur more than chance predicts? The variants differ in how they handle rare events, what range they output, and what downstream task they're tuned for.