Chapter 09

Lexical Diversity

7 formulas Intermediate

What you'll learn

  • How to quantify vocabulary richness in a text using multiple measures
  • Why the simplest measure (TTR) is fatally flawed for comparing texts of different lengths
  • How advanced measures like MTLD and MATTR correct for length dependence
  • When to use each diversity measure and how to interpret the results

Introduction

How rich is a writer's vocabulary? Does a novelist use a wider range of words than a news reporter? Lexical diversity measures answer these questions by quantifying how varied the word choices are in a given text. A text that uses many different words has high lexical diversity; one that repeats the same words over and over has low diversity.

These measures are used across computational linguistics, language assessment, clinical linguistics (tracking language development or decline), and authorship analysis. The challenge is that the simplest measure — the Type-Token Ratio — is deeply sensitive to text length, which has driven decades of research into more robust alternatives.

In this chapter, we build from the naive TTR through progressively more sophisticated measures: from frequency-spectrum statistics (Yule's K, Simpson's D) to sequential algorithms (MTLD) and windowed approaches (MATTR), culminating in the curve-fitting logic of vocd-D.

Type-Token Ratio (TTR)

Think of it this way: Count how many different words (types) appear in a text, then divide by the total number of words (tokens). If every word is unique, TTR = 1. If the same word is repeated throughout, TTR approaches 0.

Type-Token Ratio
$$\text{TTR} = \frac{\textcolor{#e11d48}{|\text{types}|}}{\textcolor{#2563eb}{|\text{tokens}|}}$$
|types| Number of unique word forms (the vocabulary size of the text)
|tokens| Total number of word occurrences in the text

Worked Example

Text: "the cat sat on the mat"

  1. Count tokens: the, cat, sat, on, the, mat → 6 tokens
  2. Count types: {the, cat, sat, on, mat} → 5 types
  3. Compute TTR: \(\text{TTR} = 5 / 6\) = 0.833
0 (no diversity)
1 (all unique)
Critical flaw: TTR systematically decreases as text length grows. A 100-word sample will almost always have a higher TTR than a 10,000-word sample from the same author, because longer texts inevitably repeat function words. This makes TTR unsuitable for comparing texts of different lengths.

Hapax Legomena Ratio

Think of it this way: How many words appear only once? These "once-words" (hapax legomena, from Greek) are a signal of vocabulary richness. A text with many unique single-use words likely draws from a larger vocabulary.

Hapax Legomena Ratio
$$\text{Hapax Ratio} = \frac{\textcolor{#059669}{V_1}}{\textcolor{#2563eb}{N}}$$
V₁ Number of words appearing exactly once (hapax legomena)
N Total number of tokens in the text

Worked Example

Text: "the cat sat on the mat" — Frequency: {the:2, cat:1, sat:1, on:1, mat:1}

  1. Words appearing exactly once: {cat, sat, on, mat} → \(V_1 = 4\)
  2. Total tokens: \(N = 6\)
  3. Hapax ratio: \(4 / 6\) = 0.667
vs. TTR: While TTR counts all unique words, the hapax ratio focuses specifically on words that appear only once. This gives a finer-grained view of vocabulary breadth, but like TTR, it is still somewhat sensitive to text length.

Yule's K

Think of it this way: If you pick two words at random from a text, what is the probability that they are the same word? Yule's K captures this idea using the full frequency spectrum — not just types and tokens, but how many words appear once, twice, three times, and so on.

Yule's K
$$K = 10^4 \times \frac{\textcolor{#d97706}{M_2} - \textcolor{#2563eb}{N}}{\textcolor{#2563eb}{N}^2}$$
Where M₂ is the second moment of the frequency spectrum
$$\textcolor{#d97706}{M_2} = \sum_{i=1}^{m} i^2 \cdot \textcolor{#7c3aed}{V_i}$$
N Total number of tokens
M₂ Sum of i² × Vᵢ across all frequency ranks
Vᵢ Number of word types that appear exactly i times (frequency spectrum)

Worked Example

Text: "the cat sat on the mat the" — Frequency: {the:3, cat:1, sat:1, on:1, mat:1}

  1. Build frequency spectrum:
    \(V_1 = 4\) (cat, sat, on, mat appear once), \(V_3 = 1\) (the appears 3 times)
  2. Compute \(M_2\):
    \(M_2 = 1^2 \times 4 + 3^2 \times 1 = 4 + 9 = 13\)
  3. \(N = 7\), so \(K = 10^4 \times \frac{13 - 7}{49} = 10^4 \times 0.1224\) = 1224.5

A lower K means more diverse vocabulary. Typical values: literary prose 80–130, news text 130–200.

Interpretation: Lower K = more diverse. Yule's K is relatively stable across text lengths because it uses the entire frequency spectrum, not just the type/token count. The 10&sup4; multiplier keeps values in a readable range.

Simpson's Diversity Index

Think of it this way: If you pick two tokens at random, what is the probability they are different types? A text with many evenly distributed word types will have D close to 1. A text dominated by one repeated word will have D close to 0.

Simpson's Diversity Index
$$D = 1 - \sum_{i=1}^{V} \left(\frac{\textcolor{#e11d48}{n_i}}{\textcolor{#2563eb}{N}}\right)^2$$
nᵢ Count of the i-th word type in the text
N Total number of tokens
V Total number of types (vocabulary size)

Worked Example

Text: "the cat sat on the mat" — Frequency: {the:2, cat:1, sat:1, on:1, mat:1}

  1. Calculate squared proportions:
    \((2/6)^2 + (1/6)^2 + (1/6)^2 + (1/6)^2 + (1/6)^2\)
    \(= 0.1111 + 0.0278 + 0.0278 + 0.0278 + 0.0278 = 0.2222\)
  2. \(D = 1 - 0.2222\) = 0.778
0 (one word)
1 (all unique)
vs. Yule's K: Simpson's D and Yule's K are mathematically related — both measure the probability of drawing the same word twice. Simpson's D is the complement (higher = more diverse), while Yule's K is scaled differently (lower = more diverse). Ecologists know Simpson's D from biodiversity studies.

MTLD (Measure of Textual Lexical Diversity)

Think of it this way: Walk through the text word by word, keeping a running TTR. Eventually, as you accumulate words, the running TTR drops below a threshold (typically 0.72). That point marks the end of one "segment." Reset and start a new segment. The average segment length is MTLD — longer segments mean richer vocabulary.

MTLD — Measure of Textual Lexical Diversity
$$\text{MTLD} = \frac{\textcolor{#2563eb}{N}}{\textcolor{#e11d48}{\text{segments}_\text{fwd}} + \textcolor{#e11d48}{\text{segments}_\text{bwd}}} \times 2$$
N Total number of tokens in the text
segments Number of segments where TTR drops below threshold (0.72), counted forward and backward
0.72 Default TTR threshold — the point at which a segment "ends" and a new one begins

Worked Example (forward pass, simplified)

Consider a 200-token text. Walking forward:

  1. Start segment 1: after 55 tokens, running TTR drops below 0.72. Segment length = 55. Reset.
  2. Segment 2: after 48 tokens, TTR drops below 0.72 again. Segment length = 48. Reset.
  3. Segment 3: after 52 tokens, TTR drops below 0.72. Segment length = 52. Remaining tokens (45) form a partial segment.
  4. Forward MTLD: partial factor = (1 - TTRremaining) / (1 - 0.72). Total segments = 3 + partial.
    Average both passes: MTLD ≈ 51.7
Key advantage: MTLD is remarkably stable across text lengths. Whether you analyze 200 words or 2,000 words from the same source, MTLD produces similar values. This makes it the recommended measure for comparing texts of different sizes.

MATTR (Moving-Average TTR)

Think of it this way: Instead of computing one TTR over the whole text, slide a fixed-size window across the text and compute TTR within each window position. Then average all those local TTRs. Because each window is the same size, the length problem vanishes.

Moving-Average Type-Token Ratio
$$\text{MATTR}(W) = \frac{1}{\textcolor{#2563eb}{N} - \textcolor{#d97706}{W} + 1} \sum_{i=1}^{N-W+1} \text{TTR}(\text{tokens}[i \ldots i{+}\textcolor{#d97706}{W}{-}1])$$
N Total number of tokens in the text
W Window size (number of tokens per window, typically 25–100)

Worked Example

Text tokens: [the, cat, sat, on, the, mat, a, dog, ran, by] (N=10, W=5)

  1. Window 1 [the,cat,sat,on,the]: 4 types / 5 tokens = 0.80
    Window 2 [cat,sat,on,the,mat]: 5/5 = 1.00
    Window 3 [sat,on,the,mat,a]: 5/5 = 1.00
    Window 4 [on,the,mat,a,dog]: 5/5 = 1.00
    Window 5 [the,mat,a,dog,ran]: 5/5 = 1.00
    Window 6 [mat,a,dog,ran,by]: 5/5 = 1.00
  2. MATTR = (0.80 + 1.00 + 1.00 + 1.00 + 1.00 + 1.00) / 6 = 0.967
vs. TTR: Standard TTR uses one window (the entire text), so it's length-dependent. MATTR uses many overlapping windows of fixed size, averaging out local variation. The window size W is a parameter you must choose — 50 is a common default.

vocd-D

Think of it this way: TTR drops as sample size grows — but how fast it drops depends on the vocabulary richness. vocd-D captures this by measuring the shape of the TTR-vs-sample-size curve. Randomly draw samples of increasing sizes, measure TTR at each, and fit the resulting curve to a mathematical model. The parameter D of the best-fitting curve is your diversity score.

vocd-D — Expected TTR at sample size N
$$\text{TTR}(N) = \frac{\textcolor{#e11d48}{D}}{\textcolor{#2563eb}{N}} \left[ \sqrt{1 + 2\frac{\textcolor{#2563eb}{N}}{\textcolor{#e11d48}{D}}} - 1 \right]$$
D The diversity parameter — higher D means richer vocabulary
N Sample size (number of tokens in the random subsample)

How it works (simplified)

  1. Randomly sample 35 tokens from the text, compute TTR. Repeat 100 times, average. Record (35, avg_TTR).
  2. Repeat for sample sizes 36, 37, … 50. You now have empirical TTR values at 16 sample sizes.
  3. Fit the curve \(\text{TTR}(N) = \frac{D}{N}(\sqrt{1 + 2N/D} - 1)\) to the empirical data by finding the D that minimizes error. D ≈ 72 (typical adult prose)
Typical values: D ranges from about 10 (very low diversity, e.g., young children) to 100+ (high diversity, e.g., academic writing). The full vocd algorithm involves curve fitting, which makes it more computationally expensive than simpler measures.

Interactive: Lexical Diversity Dashboard

Enter text below and compute all six diversity measures at once. Try pasting text from different genres to compare their vocabulary richness.

Interactive: Running TTR Chart

This chart shows how cumulative TTR changes as you read through the text token by token — visually demonstrating why TTR is length-dependent. The MATTR line (adjustable window) shows a more stable alternative.

Summary: When to Use Which

Measure Range Length-Sensitive? Best For
TTR (0, 1] Yes (strongly) Quick overview on same-length texts
Hapax Ratio [0, 1] Yes (moderately) Vocabulary breadth, Zipf analysis
Yule's K [0, +∞) Low Authorship studies, stylometry
Simpson's D [0, 1] Low Biodiversity-style analysis, intuitive probability interpretation
MTLD [0, +∞) No (stable) Comparing texts of different lengths, applied linguistics
MATTR (0, 1] No (stable) Local diversity patterns, window-based analysis
vocd-D [0, +∞) No (stable) Language development, clinical linguistics
Key Takeaway

Simple measures like TTR are easy to compute but dangerously misleading when comparing texts of different lengths. For serious analysis, use MTLD, MATTR, or vocd-D — they were specifically designed to produce stable diversity estimates regardless of text size. When in doubt, report multiple measures and note the text lengths.