Lexical Diversity
What you'll learn
- How to quantify vocabulary richness in a text using multiple measures
- Why the simplest measure (TTR) is fatally flawed for comparing texts of different lengths
- How advanced measures like MTLD and MATTR correct for length dependence
- When to use each diversity measure and how to interpret the results
Introduction
How rich is a writer's vocabulary? Does a novelist use a wider range of words than a news reporter? Lexical diversity measures answer these questions by quantifying how varied the word choices are in a given text. A text that uses many different words has high lexical diversity; one that repeats the same words over and over has low diversity.
These measures are used across computational linguistics, language assessment, clinical linguistics (tracking language development or decline), and authorship analysis. The challenge is that the simplest measure — the Type-Token Ratio — is deeply sensitive to text length, which has driven decades of research into more robust alternatives.
In this chapter, we build from the naive TTR through progressively more sophisticated measures: from frequency-spectrum statistics (Yule's K, Simpson's D) to sequential algorithms (MTLD) and windowed approaches (MATTR), culminating in the curve-fitting logic of vocd-D.
Type-Token Ratio (TTR)
Think of it this way: Count how many different words (types) appear in a text, then divide by the total number of words (tokens). If every word is unique, TTR = 1. If the same word is repeated throughout, TTR approaches 0.
Worked Example
Text: "the cat sat on the mat"
-
Count tokens: the, cat, sat, on, the, mat → 6 tokens
-
Count types: {the, cat, sat, on, mat} → 5 types
-
Compute TTR: \(\text{TTR} = 5 / 6\) = 0.833
Hapax Legomena Ratio
Think of it this way: How many words appear only once? These "once-words" (hapax legomena, from Greek) are a signal of vocabulary richness. A text with many unique single-use words likely draws from a larger vocabulary.
Worked Example
Text: "the cat sat on the mat" — Frequency: {the:2, cat:1, sat:1, on:1, mat:1}
-
Words appearing exactly once: {cat, sat, on, mat} → \(V_1 = 4\)
-
Total tokens: \(N = 6\)
-
Hapax ratio: \(4 / 6\) = 0.667
Yule's K
Think of it this way: If you pick two words at random from a text, what is the probability that they are the same word? Yule's K captures this idea using the full frequency spectrum — not just types and tokens, but how many words appear once, twice, three times, and so on.
Worked Example
Text: "the cat sat on the mat the" — Frequency: {the:3, cat:1, sat:1, on:1, mat:1}
-
Build frequency spectrum:
\(V_1 = 4\) (cat, sat, on, mat appear once), \(V_3 = 1\) (the appears 3 times) -
Compute \(M_2\):
\(M_2 = 1^2 \times 4 + 3^2 \times 1 = 4 + 9 = 13\) -
\(N = 7\), so \(K = 10^4 \times \frac{13 - 7}{49} = 10^4 \times 0.1224\) = 1224.5
A lower K means more diverse vocabulary. Typical values: literary prose 80–130, news text 130–200.
Simpson's Diversity Index
Think of it this way: If you pick two tokens at random, what is the probability they are different types? A text with many evenly distributed word types will have D close to 1. A text dominated by one repeated word will have D close to 0.
Worked Example
Text: "the cat sat on the mat" — Frequency: {the:2, cat:1, sat:1, on:1, mat:1}
-
Calculate squared proportions:
\((2/6)^2 + (1/6)^2 + (1/6)^2 + (1/6)^2 + (1/6)^2\)
\(= 0.1111 + 0.0278 + 0.0278 + 0.0278 + 0.0278 = 0.2222\) -
\(D = 1 - 0.2222\) = 0.778
MTLD (Measure of Textual Lexical Diversity)
Think of it this way: Walk through the text word by word, keeping a running TTR. Eventually, as you accumulate words, the running TTR drops below a threshold (typically 0.72). That point marks the end of one "segment." Reset and start a new segment. The average segment length is MTLD — longer segments mean richer vocabulary.
Worked Example (forward pass, simplified)
Consider a 200-token text. Walking forward:
-
Start segment 1: after 55 tokens, running TTR drops below 0.72. Segment length = 55. Reset.
-
Segment 2: after 48 tokens, TTR drops below 0.72 again. Segment length = 48. Reset.
-
Segment 3: after 52 tokens, TTR drops below 0.72. Segment length = 52. Remaining tokens (45) form a partial segment.
-
Forward MTLD: partial factor = (1 - TTRremaining) / (1 - 0.72). Total segments = 3 + partial.
Average both passes: MTLD ≈ 51.7
MATTR (Moving-Average TTR)
Think of it this way: Instead of computing one TTR over the whole text, slide a fixed-size window across the text and compute TTR within each window position. Then average all those local TTRs. Because each window is the same size, the length problem vanishes.
Worked Example
Text tokens: [the, cat, sat, on, the, mat, a, dog, ran, by] (N=10, W=5)
-
Window 1 [the,cat,sat,on,the]: 4 types / 5 tokens = 0.80
Window 2 [cat,sat,on,the,mat]: 5/5 = 1.00
Window 3 [sat,on,the,mat,a]: 5/5 = 1.00
Window 4 [on,the,mat,a,dog]: 5/5 = 1.00
Window 5 [the,mat,a,dog,ran]: 5/5 = 1.00
Window 6 [mat,a,dog,ran,by]: 5/5 = 1.00 -
MATTR = (0.80 + 1.00 + 1.00 + 1.00 + 1.00 + 1.00) / 6 = 0.967
vocd-D
Think of it this way: TTR drops as sample size grows — but how fast it drops depends on the vocabulary richness. vocd-D captures this by measuring the shape of the TTR-vs-sample-size curve. Randomly draw samples of increasing sizes, measure TTR at each, and fit the resulting curve to a mathematical model. The parameter D of the best-fitting curve is your diversity score.
How it works (simplified)
-
Randomly sample 35 tokens from the text, compute TTR. Repeat 100 times, average. Record (35, avg_TTR).
-
Repeat for sample sizes 36, 37, … 50. You now have empirical TTR values at 16 sample sizes.
-
Fit the curve \(\text{TTR}(N) = \frac{D}{N}(\sqrt{1 + 2N/D} - 1)\) to the empirical data by finding the D that minimizes error. D ≈ 72 (typical adult prose)
Interactive: Lexical Diversity Dashboard
Enter text below and compute all six diversity measures at once. Try pasting text from different genres to compare their vocabulary richness.
Interactive: Running TTR Chart
This chart shows how cumulative TTR changes as you read through the text token by token — visually demonstrating why TTR is length-dependent. The MATTR line (adjustable window) shows a more stable alternative.
Summary: When to Use Which
| Measure | Range | Length-Sensitive? | Best For |
|---|---|---|---|
| TTR | (0, 1] | Yes (strongly) | Quick overview on same-length texts |
| Hapax Ratio | [0, 1] | Yes (moderately) | Vocabulary breadth, Zipf analysis |
| Yule's K | [0, +∞) | Low | Authorship studies, stylometry |
| Simpson's D | [0, 1] | Low | Biodiversity-style analysis, intuitive probability interpretation |
| MTLD | [0, +∞) | No (stable) | Comparing texts of different lengths, applied linguistics |
| MATTR | (0, 1] | No (stable) | Local diversity patterns, window-based analysis |
| vocd-D | [0, +∞) | No (stable) | Language development, clinical linguistics |
Simple measures like TTR are easy to compute but dangerously misleading when comparing texts of different lengths. For serious analysis, use MTLD, MATTR, or vocd-D — they were specifically designed to produce stable diversity estimates regardless of text size. When in doubt, report multiple measures and note the text lengths.