Chapter 02

Co-occurrence & Association Measures

9 formulas Beginner

What you'll learn

  • How to build and interpret a 2×2 contingency table for word co-occurrence
  • Nine different ways to measure whether two words are associated
  • Why different measures disagree and when to trust each one
  • How to compare association measures side-by-side on real text
Prerequisites

This chapter builds on Chapter 1: PMI and Its Variants. Make sure you understand basic PMI before diving into these broader association measures.

Introduction

In Chapter 1, we measured word association using PMI and its variants. But PMI is just one member of a large family of association measures — statistical tools that answer the question: do these two words co-occur more than chance predicts?

Different measures approach this question from different angles. Some use information theory (MI, LLR), others use classical hypothesis testing (χ², t-score, z-score, Fisher's exact), and still others use set-theoretic overlap (Dice, Jaccard, logDice). Each has strengths: some handle sparse data better, some are more interpretable, and some are more stable across corpus sizes.

The foundation for most of these measures is the 2×2 contingency table, which counts how often two words appear together, separately, and not at all within some unit (a sentence, a document, or a co-occurrence window).

Mutual Information (MI)

Think of it this way: While PMI measures the association between a specific pair of words, Mutual Information averages this across all possible outcomes in the contingency table. It asks: how much does knowing whether word X is present tell you about whether word Y is present?

Mutual Information
$$\text{MI}(X;Y) = \sum_{x}\sum_{y} \textcolor{#059669}{p(x,y)} \cdot \log_2 \frac{\textcolor{#059669}{p(x,y)}}{\textcolor{#e11d48}{p(x)} \cdot \textcolor{#2563eb}{p(y)}}$$
p(x) Marginal probability of word x (present or absent)
p(y) Marginal probability of word y (present or absent)
p(x,y) Joint probability for each cell of the 2×2 table

Worked Example

From 100 sentences: 15 have both "stock" and "market," 10 have only "stock," 5 have only "market," 70 have neither (a=15, b=10, c=5, d=70, N=100).

  1. Calculate all joint probabilities:
    \(p(\text{both}) = 15/100 = 0.15\), \(p(\text{stock only}) = 0.10\), \(p(\text{market only}) = 0.05\), \(p(\text{neither}) = 0.70\)
  2. Calculate marginals:
    \(p(\text{stock}) = (15+10)/100 = 0.25\), \(p(\text{market}) = (15+5)/100 = 0.20\)
  3. Sum over all four cells:
    \(\text{MI} = 0.15 \log_2\!\frac{0.15}{0.25 \times 0.20} + 0.10 \log_2\!\frac{0.10}{0.25 \times 0.80} + \ldots\) ≈ 0.222 bits

An MI of 0.222 bits means knowing one word reduces your uncertainty about the other by about 0.222 bits.

MI vs. PMI: PMI is for a specific pair of outcomes; MI is the expected value of PMI over all outcomes. MI is always non-negative (MI ≥ 0), while PMI can be negative.

Log-Likelihood Ratio (LLR)

Think of it this way: The Log-Likelihood Ratio asks "how surprised should we be by this contingency table?" It compares two models — one where the words are independent and one where they're not — and measures how much better the non-independent model fits the data.

Log-Likelihood Ratio (G²)
$$G^2 = 2 \sum \textcolor{#e11d48}{O} \cdot \ln\!\frac{\textcolor{#e11d48}{O}}{\textcolor{#2563eb}{E}}$$
O Observed frequency in each cell of the contingency table
E Expected frequency under independence: E = (row total × column total) / N

Worked Example

Using the same table: a=15, b=10, c=5, d=70, N=100.

  1. Calculate expected values under independence:
    \(E_a = 25 \times 20 / 100 = 5.0\), \(E_b = 25 \times 80 / 100 = 20.0\)
    \(E_c = 75 \times 20 / 100 = 15.0\), \(E_d = 75 \times 80 / 100 = 60.0\)
  2. Sum O × ln(O/E) for each cell:
    \(15 \ln(15/5) + 10 \ln(10/20) + 5 \ln(5/15) + 70 \ln(70/60)\)
  3. Multiply by 2:
    \(G^2 = 2 \times (16.48 - 6.93 - 5.49 + 10.80)\) ≈ 29.70

With 1 degree of freedom, G² = 29.70 far exceeds the critical value of 3.84 (p < 0.05), confirming a significant association.

LLR vs. χ²: Both test for independence, but LLR is more reliable when expected counts are small (sparse data). LLR follows an asymptotic χ² distribution with 1 df, so you can use the same critical value tables.

Chi-Squared Test (χ²)

Think of it this way: For each cell in the table, ask: "How far off was the observed count from what we'd expect under independence?" Square those differences (so negative and positive deviations both count), divide by what was expected (to normalize), and add them all up.

Chi-Squared (simplified for 2×2 table)
$$\chi^2 = \frac{N \cdot (\textcolor{#e11d48}{a}\textcolor{#2563eb}{d} - \textcolor{#059669}{b}\textcolor{#d97706}{c})^2}{(\textcolor{#e11d48}{a}+\textcolor{#059669}{b})(\textcolor{#d97706}{c}+\textcolor{#2563eb}{d})(\textcolor{#e11d48}{a}+\textcolor{#d97706}{c})(\textcolor{#059669}{b}+\textcolor{#2563eb}{d})}$$
a Both words present (co-occurrence count)
b Word 1 present, word 2 absent
c Word 1 absent, word 2 present
d Neither word present

Worked Example

Continuing with a=15, b=10, c=5, d=70, N=100.

  1. Compute the cross-product difference:
    \(ad - bc = (15)(70) - (10)(5) = 1050 - 50 = 1000\)
  2. Compute the denominator:
    \((25)(75)(20)(80) = 3{,}000{,}000\)
  3. Plug into the formula:
    \(\chi^2 = \frac{100 \times 1000^2}{3{,}000{,}000} = \frac{100{,}000{,}000}{3{,}000{,}000}\) ≈ 33.33

χ² = 33.33 with 1 degree of freedom gives a p-value well below 0.001 — a highly significant association.

χ² vs. LLR: Both measure departure from independence, but χ² can be unreliable when any expected cell count falls below 5. For language data (which is typically very sparse), LLR is preferred. Use χ² when you have ample data.

Dice Coefficient

Think of it this way: Treat each word's set of sentences as a bag. Dice measures the overlap: how much do these two bags share? It's the same formula as the F1 score in machine learning — the harmonic mean of precision and recall if you treat one word as "predictions" and the other as "true labels."

Dice Coefficient
$$\text{Dice}(X, Y) = \frac{2\,\textcolor{#059669}{|X \cap Y|}}{\textcolor{#e11d48}{|X|} + \textcolor{#2563eb}{|Y|}} = \frac{2\textcolor{#059669}{a}}{2\textcolor{#059669}{a} + \textcolor{#e11d48}{b} + \textcolor{#2563eb}{c}}$$
|X ∩ Y| Number of sentences containing both words (= a)
|X| Number of sentences containing word X (= a + b)
|Y| Number of sentences containing word Y (= a + c)

Worked Example

Using a=15, b=10, c=5.

  1. Count overlapping and total sentences:
    Overlap = 15, |X| = 25, |Y| = 20
  2. Apply the formula:
    \(\text{Dice} = \frac{2 \times 15}{25 + 20} = \frac{30}{45}\) = 0.667

A Dice of 0.667 means two-thirds overlap — a strong association in set-overlap terms.

Dice vs. Jaccard: Dice is always ≥ Jaccard for the same data. Dice = 2J/(1+J), so a Jaccard of 0.5 equals a Dice of 0.667. Dice weights the overlap more generously.

Jaccard Similarity

Think of it this way: Of all the sentences that mention at least one of the two words, what fraction mentions both? This is a stricter overlap measure than Dice because it divides by the union (everything that mentions either word) rather than by the sum of the two sets.

Jaccard Similarity Index
$$J(X, Y) = \frac{\textcolor{#059669}{|X \cap Y|}}{\textcolor{#d97706}{|X \cup Y|}} = \frac{\textcolor{#059669}{a}}{\textcolor{#059669}{a} + \textcolor{#e11d48}{b} + \textcolor{#2563eb}{c}}$$
|X ∩ Y| Intersection: sentences with both words (= a)
|X ∪ Y| Union: sentences with at least one word (= a + b + c)

Worked Example

Using a=15, b=10, c=5.

  1. Count intersection and union:
    Intersection = 15, Union = 15 + 10 + 5 = 30
  2. Apply the formula:
    \(J = \frac{15}{30}\) = 0.500

A Jaccard of 0.500 means half the sentences that mention either word actually mention both. Compare to Dice = 0.667 for the same data.

vs. Dice: Jaccard and Dice contain the same information — you can convert between them: J = D/(2-D). Jaccard is preferred when you want a true metric distance (1-J satisfies the triangle inequality), while Dice is more common in NLP collocation work.

t-score

Think of it this way: How many "standard deviations" is the observed co-occurrence count above what we'd expect by chance? The t-score uses a simplified variance estimate (just the square root of the observed count) which makes it conservative — it favors frequent co-occurrences over rare but striking ones.

t-score for Association
$$t = \frac{\textcolor{#e11d48}{O} - \textcolor{#2563eb}{E}}{\sqrt{\textcolor{#e11d48}{O}}}$$
O Observed co-occurrence count (= a, the both-present cell)
E Expected count under independence: E = (a+b)(a+c)/N

Worked Example

Using a=15, b=10, c=5, d=70, N=100.

  1. Calculate expected count:
    \(E = \frac{25 \times 20}{100} = 5.0\)
  2. Apply the formula:
    \(t = \frac{15 - 5}{\sqrt{15}} = \frac{10}{3.873}\) ≈ 2.582

A t-score of 2.582 exceeds the conventional threshold of 2.0, confirming a significant association.

t-score vs. MI/PMI: The t-score has a strong frequency bias — it ranks common collocations (like "of the," "in the") high, while MI/PMI favor rare but strongly associated pairs ("ad hoc," "vice versa"). This makes the t-score a better choice when you want to find reliable collocations rather than surprising ones.

z-score

Think of it this way: The z-score is like the t-score's more precise cousin. Instead of using the observed count as the variance estimate, it uses the proper standard deviation under the assumption that co-occurrences follow a binomial (approximately normal) distribution. This gives a more accurate measure of "how many standard deviations from expected" the observation lies.

z-score for Association
$$z = \frac{\textcolor{#e11d48}{O} - \textcolor{#2563eb}{\mu}}{\textcolor{#059669}{\sigma}}$$
O Observed co-occurrence count (= a)
μ Expected count (mean under independence): μ = (a+b)(a+c)/N
σ Standard deviation: σ = √(E × (1 - E/N))

Worked Example

Using a=15, b=10, c=5, d=70, N=100.

  1. Calculate expected count:
    \(\mu = E = \frac{25 \times 20}{100} = 5.0\)
  2. Calculate standard deviation:
    \(\sigma = \sqrt{5.0 \times (1 - 5.0/100)} = \sqrt{5.0 \times 0.95} = \sqrt{4.75}\) ≈ 2.179
  3. Apply the formula:
    \(z = \frac{15 - 5}{2.179}\) ≈ 4.589

A z-score of 4.589 corresponds to a p-value well below 0.001, far exceeding the significance threshold of z = 1.96 for a two-tailed test at the 0.05 level.

z-score vs. t-score: For the same data, the z-score is typically larger than the t-score because it uses a tighter variance estimate. The t-score's denominator (√O) is always at least as large as the z-score's (√(E(1-E/N))). In practice, they rank collocations very similarly; the t-score is more common in NLP because of its simplicity.

Log-Dice

Think of it this way: LogDice takes the Dice coefficient and applies a logarithmic transformation that makes the values easier to interpret. The key advantage: logDice scores are stable across corpus sizes. A collocation with logDice = 10 in a 1-million-word corpus will get roughly logDice = 10 in a 100-million-word corpus too. The theoretical maximum is 14.

Log-Dice
$$\text{logDice} = 14 + \log_2 \frac{2 \cdot \textcolor{#059669}{f(x,y)}}{\textcolor{#e11d48}{f(x)} + \textcolor{#2563eb}{f(y)}}$$
f(x,y) Co-occurrence frequency (= a, the both-present count)
f(x) Frequency of word x (= a + b, sentences with word x)
f(y) Frequency of word y (= a + c, sentences with word y)

Worked Example

Using a=15, b=10, c=5 (so f(x)=25, f(y)=20).

  1. Compute the Dice ratio:
    \(\frac{2 \times 15}{25 + 20} = \frac{30}{45} = 0.667\)
  2. Apply the log-Dice formula:
    \(\text{logDice} = 14 + \log_2(0.667) = 14 + (-0.585)\) ≈ 13.415

A logDice of 13.415 (out of a maximum of 14) indicates an extremely strong association. Typical collocations score between 7 and 12.

vs. Dice: LogDice is just a log-transformed Dice coefficient with a constant offset. The offset of 14 ensures that logDice values are always positive for word pairs that co-occur at all (since Dice > 0 implies log&sub2;(Dice) > -14). The logarithmic scale makes differences more perceptible.

Fisher's Exact Test

Think of it this way: While χ² and LLR approximate the probability of seeing your data under independence, Fisher's exact test computes it exactly using the hypergeometric distribution. It's the gold standard for small samples — no approximation, no minimum cell count requirements.

Fisher's Exact Test (one-tailed p-value)
$$p = \sum_{k=a}^{\min(a+b,\, a+c)} \frac{\binom{\textcolor{#e11d48}{a+b}}{k}\,\binom{\textcolor{#2563eb}{c+d}}{a+c-k}}{\binom{\textcolor{#059669}{N}}{a+c}}$$
a+b Row 1 total (sentences with word 1)
c+d Row 2 total (sentences without word 1)
N Grand total of sentences

Worked Example

Using a=15, b=10, c=5, d=70, N=100.

  1. Identify the hypergeometric parameters:
    Row 1 total = 25, Row 2 total = 75, Column 1 total = 20, Observed = 15
  2. Sum probabilities for a ≥ 15 (one-tailed):
    \(p = P(X=15) + P(X=16) + \ldots + P(X=20)\)
  3. Evaluate (using log-factorials for numerical stability):
    p ≈ 1.39 × 10&sup-7;

An astronomically small p-value confirms that this co-occurrence pattern is essentially impossible under independence.

vs. χ² / LLR: Fisher's test is exact (no approximation) and valid for any sample size, even when cell counts are very small. The downside: it's computationally more expensive for large tables, and it doesn't produce an "effect size" — only a p-value. Use Fisher's for significance testing, and MI or logDice for ranking associations by strength.

Interactive: Association Measure Comparison

Enter text below and pick two words. The demo builds a 2×2 contingency table from sentence-level co-occurrence and computes all nine association measures side-by-side.

Summary: When to Use Which

Measure Type Range Handles Sparse Data Best For
MI Information-theoretic [0, +∞) Moderate Overall dependence between two variables
LLR (G²) Hypothesis test [0, +∞) Good Reliable significance testing, lexicography
χ² Hypothesis test [0, +∞) Poor Large corpora with ample expected counts
Dice Set overlap [0, 1] Good Intuitive overlap, equivalent to F1
Jaccard Set overlap [0, 1] Good True metric distance (1-J), document similarity
t-score Hypothesis test (-∞, +∞) Moderate Finding frequent, reliable collocations
z-score Hypothesis test (-∞, +∞) Moderate Precise significance with normal approximation
logDice Set overlap (log) (-∞, 14] Good Cross-corpus comparison, lexicography
Fisher's exact Exact test [0, 1] (p-value) Excellent Small samples, gold-standard significance
Key Takeaway

No single association measure is best for all purposes. Use LLR or Fisher's exact for significance testing, logDice or Dice for stable collocation ranking, MI for information-theoretic analysis, and t-score when you want frequency-biased results. When in doubt, compute several measures and look for agreement.