Chapter 03

TF-IDF Family

3 formulas Beginner Prerequisites: Ch01, Ch02

What you'll learn

How TF-IDF identifies the most important words in a document relative to a collection
Why common words like "the" get low scores and rare, discriminative words get high scores
How BM25 improves on TF-IDF with term saturation and length normalization
How TF-ICF adapts the same idea for text classification tasks

Introduction

Suppose you have thousands of documents and a user types a search query. Which documents are most relevant? Simply counting how often query words appear won't work — common words like "the" appear everywhere and would dominate the scores.

The TF-IDF family solves this by balancing two forces: how important a word is within a document (term frequency) versus how common it is across documents (inverse document frequency). Words that appear frequently in one document but rarely elsewhere are the best discriminators — and they get the highest scores.

This chapter covers three members of the family: classic TF-IDF, the industry-standard BM25 ranking function, and TF-ICF for classification. Together they power everything from web search engines to spam filters.

TF-IDF (Term Frequency – Inverse Document Frequency)

Think of it this way: A word's importance in a document is the product of two things: how often it appears in that document (TF) and how rare it is across all documents (IDF). The word "algorithm" appearing 5 times in a computer science paper is informative because "algorithm" doesn't appear in most documents. The word "the" appearing 50 times is meaningless because every document has it.

TF-IDF

$$\text{TF-IDF}(\textcolor{#e11d48}{t}, \textcolor{#2563eb}{d}, \textcolor{#059669}{D}) = \textcolor{#d97706}{\text{TF}(t, d)} \times \textcolor{#7c3aed}{\text{IDF}(t, D)}$$

Term Frequency (raw)

$$\textcolor{#d97706}{\text{TF}(t, d)} = \frac{\text{count}(\textcolor{#e11d48}{t},\, \textcolor{#2563eb}{d})}{|\textcolor{#2563eb}{d}|}$$

Inverse Document Frequency

$$\textcolor{#7c3aed}{\text{IDF}(t, D)} = \log_2 \frac{\textcolor{#059669}{N}}{\textcolor{#0891b2}{\text{df}(t)}}$$

t The term (word) we are scoring

d The specific document

N Total number of documents in the collection

TF(t,d) Term frequency — count of t in d, divided by document length

IDF(t) Inverse document frequency — log of total docs divided by docs containing t

df(t) Document frequency — number of documents that contain term t

TF Variants

There are several ways to compute term frequency:

Raw TF: $\text{count}(t, d) / |d|$ — simple proportion (used above)
Log-normalized TF: $1 + \log(1 + \text{count}(t, d))$ — dampens the effect of high counts
Boolean TF: $1$ if t appears in d, else $0$ — ignores repetition entirely
Augmented TF: $0.5 + 0.5 \cdot \frac{\text{count}(t, d)}{\max_t \text{count}(t', d)}$ — normalized by the most frequent term

Worked Example

Given a collection of N = 1000 documents, where document d has 200 words and contains "algorithm" 6 times, and "algorithm" appears in 50 of the 1000 documents:

Calculate term frequency:
$\text{TF}(\text{algorithm}, d) = 6 / 200 = 0.03$
Calculate inverse document frequency:
$\text{IDF}(\text{algorithm}) = \log_2(1000 / 50) = \log_2(20) = 4.32$
Multiply to get TF-IDF:
$\text{TF-IDF} = 0.03 \times 4.32$ = 0.1296

Compare this with "the," which might appear in all 1000 documents: $\text{IDF}(\text{the}) = \log_2(1000/1000) = 0$. No matter how often "the" appears, its TF-IDF is zero — exactly as we want.

Key insight: IDF acts as a filter. Words appearing in every document get IDF = 0 and are effectively ignored. Words appearing in only 1 document get the maximum IDF = log₂(N). TF-IDF thus automatically surfaces rare, discriminative terms.

BM25 (Okapi BM25)

Think of it this way: TF-IDF has a problem: if a word appears 100 times versus 10 times, TF-IDF gives it 10x the score. But does the 100th occurrence really add as much evidence as the 1st? BM25 says no. It applies diminishing returns to term frequency — after a few occurrences the score saturates. It also normalizes for document length, so long documents don't get unfair advantage just because they contain more words.

BM25 (Okapi BM25)

$$\text{BM25}(\textcolor{#2563eb}{D}, \textcolor{#e11d48}{Q}) = \sum_{i=1}^{|Q|} \textcolor{#7c3aed}{\text{IDF}(q_i)} \cdot \frac{\textcolor{#d97706}{f(q_i, D)} \cdot (\textcolor{#059669}{k_1} + 1)}{\textcolor{#d97706}{f(q_i, D)} + \textcolor{#059669}{k_1} \cdot \Big(1 - \textcolor{#0891b2}{b} + \textcolor{#0891b2}{b} \cdot \dfrac{|D|}{\text{avgdl}}\Big)}$$

Q The search query, consisting of terms q₁, q₂, ...

D The document being scored

f(q_i, D) Raw count of query term q_i in document D

k₁ Term saturation parameter (typically 1.2). Controls how quickly TF saturates. Higher = slower saturation

b Length normalization parameter (typically 0.75). 0 = no normalization, 1 = full normalization

IDF(q_i) Inverse document frequency of query term (same concept as TF-IDF)

Understanding k₁ and b

k₁ (saturation): When k₁ = 0, BM25 becomes a boolean model (only presence/absence matters). As k₁ grows, term frequency matters more. At k₁ = 1.2 (default), a word appearing 3 times scores about 80% of maximum — additional occurrences help very little.
b (length normalization): When b = 0, document length is ignored. When b = 1, the score is fully normalized by length relative to the average. At b = 0.75 (default), a document twice the average length needs roughly twice as many query term occurrences to achieve the same score.

Worked Example

Query "machine learning", Document D has 80 words, avgdl = 100, N = 1000 docs. "machine" appears 3 times (df = 200), "learning" appears 2 times (df = 150). Using k₁ = 1.2, b = 0.75:

Calculate IDF for each query term:
$\text{IDF}(\text{machine}) = \log_2\!\big(\frac{1000}{200}\big) = \log_2(5) = 2.32$
$\text{IDF}(\text{learning}) = \log_2\!\big(\frac{1000}{150}\big) = \log_2(6.67) = 2.74$
Calculate the length normalization factor:
$1 - b + b \cdot \frac{|D|}{\text{avgdl}} = 1 - 0.75 + 0.75 \times \frac{80}{100} = 0.25 + 0.6 = 0.85$
Score for "machine" (f = 3):
$\frac{3 \times (1.2 + 1)}{3 + 1.2 \times 0.85} = \frac{3 \times 2.2}{3 + 1.02} = \frac{6.6}{4.02} = 1.642$
Contribution: $2.32 \times 1.642 = 3.81$
Score for "learning" (f = 2):
$\frac{2 \times 2.2}{2 + 1.02} = \frac{4.4}{3.02} = 1.457$
Contribution: $2.74 \times 1.457 = 3.99$
Total BM25 score:
$3.81 + 3.99$ = 7.80

vs. TF-IDF: BM25 improves on TF-IDF in two critical ways. First, term frequency has diminishing returns — going from 0 to 1 occurrence matters most, and each additional occurrence contributes less. Second, document length is explicitly normalized, so a 10-word document mentioning "learning" once is treated as more relevant than a 1000-word document mentioning it once. These two improvements make BM25 the dominant ranking function in modern search engines.

TF-ICF (Term Frequency – Inverse Class Frequency)

Think of it this way: TF-IDF asks "how rare is this word across documents?" TF-ICF asks "how rare is this word across classes?" If you're classifying news articles into Sports, Politics, and Science, the word "touchdown" may appear in many sports documents but in zero politics documents. TF-ICF captures this: words confined to a single class get the highest ICF, making them powerful classification features.

TF-ICF

$$\text{TF-ICF}(\textcolor{#e11d48}{t}, \textcolor{#2563eb}{d}) = \textcolor{#d97706}{\text{TF}(t, d)} \times \textcolor{#7c3aed}{\text{ICF}(t)}$$

Inverse Class Frequency

$$\textcolor{#7c3aed}{\text{ICF}(t)} = \log_2 \frac{\textcolor{#059669}{C}}{\textcolor{#0891b2}{\text{cf}(t)}}$$

C Total number of classes (categories) in the classification task

cf(t) Class frequency — number of classes in which term t appears

TF(t,d) Term frequency of t in document d (same TF variants as TF-IDF apply)

ICF(t) Inverse class frequency — high when t appears in few classes

Worked Example

We have C = 4 classes: Sports, Politics, Science, Cooking. The word "photosynthesis" appears in 1 class (Science). In a particular science document of 150 words, it appears 3 times:

Calculate term frequency:
$\text{TF}(\text{photosynthesis}, d) = 3 / 150 = 0.02$
Calculate inverse class frequency:
$\text{ICF}(\text{photosynthesis}) = \log_2(4 / 1) = \log_2(4) = 2.0$
Multiply:
$\text{TF-ICF} = 0.02 \times 2.0$ = 0.04

Now compare with "the," which appears in all 4 classes: $\text{ICF}(\text{the}) = \log_2(4/4) = 0$. As with TF-IDF, common-across-all words are zeroed out, but here the discriminating unit is the class, not the document.

vs. TF-IDF: The only structural change is replacing document frequency with class frequency. This makes TF-ICF better suited for classification tasks where you care about which class a word distinguishes rather than which individual document. A word that appears in 500 documents but only in the "Science" class will have a low IDF but a high ICF.

Interactive: TF-IDF & BM25 Calculator

Explore how TF-IDF and BM25 score and rank documents. Enter your own text or use the defaults.

Demo 1: Multi-Document TF-IDF Calculator

Enter text in three documents and a search query. See how TF, IDF, and TF-IDF scores are computed for each query term across documents.

Demo 2: BM25 Ranker

Using the same document inputs, see how BM25 ranks documents by relevance. Adjust k₁ and b to see their effect on ranking.

Summary: When to Use Which

Formula	Key Idea	Strengths	Best For
TF-IDF	TF × IDF	Simple, interpretable, fast	Feature extraction, document-term matrices, keyword extraction
BM25	Saturating TF + length norm	Diminishing returns, length normalization, tunable	Search ranking, document retrieval, first-pass retrieval
TF-ICF	TF × Inverse Class Freq	Class-level discrimination	Feature selection for classifiers, topic labeling

Key Takeaway

The TF-IDF family shares a common principle: balance local importance (how often a term appears here) against global rarity (how unusual the term is overall). TF-IDF does this most simply, BM25 adds sophistication with saturation and length normalization, and TF-ICF shifts the rarity measure from documents to classes for classification tasks.