Co-occurrence & Association Measures
What you'll learn
- How to build and interpret a 2×2 contingency table for word co-occurrence
- Nine different ways to measure whether two words are associated
- Why different measures disagree and when to trust each one
- How to compare association measures side-by-side on real text
This chapter builds on Chapter 1: PMI and Its Variants. Make sure you understand basic PMI before diving into these broader association measures.
Introduction
In Chapter 1, we measured word association using PMI and its variants. But PMI is just one member of a large family of association measures — statistical tools that answer the question: do these two words co-occur more than chance predicts?
Different measures approach this question from different angles. Some use information theory (MI, LLR), others use classical hypothesis testing (χ², t-score, z-score, Fisher's exact), and still others use set-theoretic overlap (Dice, Jaccard, logDice). Each has strengths: some handle sparse data better, some are more interpretable, and some are more stable across corpus sizes.
The foundation for most of these measures is the 2×2 contingency table, which counts how often two words appear together, separately, and not at all within some unit (a sentence, a document, or a co-occurrence window).
Mutual Information (MI)
Think of it this way: While PMI measures the association between a specific pair of words, Mutual Information averages this across all possible outcomes in the contingency table. It asks: how much does knowing whether word X is present tell you about whether word Y is present?
Worked Example
From 100 sentences: 15 have both "stock" and "market," 10 have only "stock," 5 have only "market," 70 have neither (a=15, b=10, c=5, d=70, N=100).
-
Calculate all joint probabilities:
\(p(\text{both}) = 15/100 = 0.15\), \(p(\text{stock only}) = 0.10\), \(p(\text{market only}) = 0.05\), \(p(\text{neither}) = 0.70\) -
Calculate marginals:
\(p(\text{stock}) = (15+10)/100 = 0.25\), \(p(\text{market}) = (15+5)/100 = 0.20\) -
Sum over all four cells:
\(\text{MI} = 0.15 \log_2\!\frac{0.15}{0.25 \times 0.20} + 0.10 \log_2\!\frac{0.10}{0.25 \times 0.80} + \ldots\) ≈ 0.222 bits
An MI of 0.222 bits means knowing one word reduces your uncertainty about the other by about 0.222 bits.
Log-Likelihood Ratio (LLR)
Think of it this way: The Log-Likelihood Ratio asks "how surprised should we be by this contingency table?" It compares two models — one where the words are independent and one where they're not — and measures how much better the non-independent model fits the data.
Worked Example
Using the same table: a=15, b=10, c=5, d=70, N=100.
-
Calculate expected values under independence:
\(E_a = 25 \times 20 / 100 = 5.0\), \(E_b = 25 \times 80 / 100 = 20.0\)
\(E_c = 75 \times 20 / 100 = 15.0\), \(E_d = 75 \times 80 / 100 = 60.0\) -
Sum O × ln(O/E) for each cell:
\(15 \ln(15/5) + 10 \ln(10/20) + 5 \ln(5/15) + 70 \ln(70/60)\) -
Multiply by 2:
\(G^2 = 2 \times (16.48 - 6.93 - 5.49 + 10.80)\) ≈ 29.70
With 1 degree of freedom, G² = 29.70 far exceeds the critical value of 3.84 (p < 0.05), confirming a significant association.
Chi-Squared Test (χ²)
Think of it this way: For each cell in the table, ask: "How far off was the observed count from what we'd expect under independence?" Square those differences (so negative and positive deviations both count), divide by what was expected (to normalize), and add them all up.
Worked Example
Continuing with a=15, b=10, c=5, d=70, N=100.
-
Compute the cross-product difference:
\(ad - bc = (15)(70) - (10)(5) = 1050 - 50 = 1000\) -
Compute the denominator:
\((25)(75)(20)(80) = 3{,}000{,}000\) -
Plug into the formula:
\(\chi^2 = \frac{100 \times 1000^2}{3{,}000{,}000} = \frac{100{,}000{,}000}{3{,}000{,}000}\) ≈ 33.33
χ² = 33.33 with 1 degree of freedom gives a p-value well below 0.001 — a highly significant association.
Dice Coefficient
Think of it this way: Treat each word's set of sentences as a bag. Dice measures the overlap: how much do these two bags share? It's the same formula as the F1 score in machine learning — the harmonic mean of precision and recall if you treat one word as "predictions" and the other as "true labels."
Worked Example
Using a=15, b=10, c=5.
-
Count overlapping and total sentences:
Overlap = 15, |X| = 25, |Y| = 20 -
Apply the formula:
\(\text{Dice} = \frac{2 \times 15}{25 + 20} = \frac{30}{45}\) = 0.667
A Dice of 0.667 means two-thirds overlap — a strong association in set-overlap terms.
Jaccard Similarity
Think of it this way: Of all the sentences that mention at least one of the two words, what fraction mentions both? This is a stricter overlap measure than Dice because it divides by the union (everything that mentions either word) rather than by the sum of the two sets.
Worked Example
Using a=15, b=10, c=5.
-
Count intersection and union:
Intersection = 15, Union = 15 + 10 + 5 = 30 -
Apply the formula:
\(J = \frac{15}{30}\) = 0.500
A Jaccard of 0.500 means half the sentences that mention either word actually mention both. Compare to Dice = 0.667 for the same data.
t-score
Think of it this way: How many "standard deviations" is the observed co-occurrence count above what we'd expect by chance? The t-score uses a simplified variance estimate (just the square root of the observed count) which makes it conservative — it favors frequent co-occurrences over rare but striking ones.
Worked Example
Using a=15, b=10, c=5, d=70, N=100.
-
Calculate expected count:
\(E = \frac{25 \times 20}{100} = 5.0\) -
Apply the formula:
\(t = \frac{15 - 5}{\sqrt{15}} = \frac{10}{3.873}\) ≈ 2.582
A t-score of 2.582 exceeds the conventional threshold of 2.0, confirming a significant association.
z-score
Think of it this way: The z-score is like the t-score's more precise cousin. Instead of using the observed count as the variance estimate, it uses the proper standard deviation under the assumption that co-occurrences follow a binomial (approximately normal) distribution. This gives a more accurate measure of "how many standard deviations from expected" the observation lies.
Worked Example
Using a=15, b=10, c=5, d=70, N=100.
-
Calculate expected count:
\(\mu = E = \frac{25 \times 20}{100} = 5.0\) -
Calculate standard deviation:
\(\sigma = \sqrt{5.0 \times (1 - 5.0/100)} = \sqrt{5.0 \times 0.95} = \sqrt{4.75}\) ≈ 2.179 -
Apply the formula:
\(z = \frac{15 - 5}{2.179}\) ≈ 4.589
A z-score of 4.589 corresponds to a p-value well below 0.001, far exceeding the significance threshold of z = 1.96 for a two-tailed test at the 0.05 level.
Log-Dice
Think of it this way: LogDice takes the Dice coefficient and applies a logarithmic transformation that makes the values easier to interpret. The key advantage: logDice scores are stable across corpus sizes. A collocation with logDice = 10 in a 1-million-word corpus will get roughly logDice = 10 in a 100-million-word corpus too. The theoretical maximum is 14.
Worked Example
Using a=15, b=10, c=5 (so f(x)=25, f(y)=20).
-
Compute the Dice ratio:
\(\frac{2 \times 15}{25 + 20} = \frac{30}{45} = 0.667\) -
Apply the log-Dice formula:
\(\text{logDice} = 14 + \log_2(0.667) = 14 + (-0.585)\) ≈ 13.415
A logDice of 13.415 (out of a maximum of 14) indicates an extremely strong association. Typical collocations score between 7 and 12.
Fisher's Exact Test
Think of it this way: While χ² and LLR approximate the probability of seeing your data under independence, Fisher's exact test computes it exactly using the hypergeometric distribution. It's the gold standard for small samples — no approximation, no minimum cell count requirements.
Worked Example
Using a=15, b=10, c=5, d=70, N=100.
-
Identify the hypergeometric parameters:
Row 1 total = 25, Row 2 total = 75, Column 1 total = 20, Observed = 15 -
Sum probabilities for a ≥ 15 (one-tailed):
\(p = P(X=15) + P(X=16) + \ldots + P(X=20)\) -
Evaluate (using log-factorials for numerical stability):
p ≈ 1.39 × 10&sup-7;
An astronomically small p-value confirms that this co-occurrence pattern is essentially impossible under independence.
Interactive: Association Measure Comparison
Enter text below and pick two words. The demo builds a 2×2 contingency table from sentence-level co-occurrence and computes all nine association measures side-by-side.
Summary: When to Use Which
| Measure | Type | Range | Handles Sparse Data | Best For |
|---|---|---|---|---|
| MI | Information-theoretic | [0, +∞) | Moderate | Overall dependence between two variables |
| LLR (G²) | Hypothesis test | [0, +∞) | Good | Reliable significance testing, lexicography |
| χ² | Hypothesis test | [0, +∞) | Poor | Large corpora with ample expected counts |
| Dice | Set overlap | [0, 1] | Good | Intuitive overlap, equivalent to F1 |
| Jaccard | Set overlap | [0, 1] | Good | True metric distance (1-J), document similarity |
| t-score | Hypothesis test | (-∞, +∞) | Moderate | Finding frequent, reliable collocations |
| z-score | Hypothesis test | (-∞, +∞) | Moderate | Precise significance with normal approximation |
| logDice | Set overlap (log) | (-∞, 14] | Good | Cross-corpus comparison, lexicography |
| Fisher's exact | Exact test | [0, 1] (p-value) | Excellent | Small samples, gold-standard significance |
No single association measure is best for all purposes. Use LLR or Fisher's exact for significance testing, logDice or Dice for stable collocation ranking, MI for information-theoretic analysis, and t-score when you want frequency-biased results. When in doubt, compute several measures and look for agreement.