TF-IDF Family
What you'll learn
- How TF-IDF identifies the most important words in a document relative to a collection
- Why common words like "the" get low scores and rare, discriminative words get high scores
- How BM25 improves on TF-IDF with term saturation and length normalization
- How TF-ICF adapts the same idea for text classification tasks
Introduction
Suppose you have thousands of documents and a user types a search query. Which documents are most relevant? Simply counting how often query words appear won't work — common words like "the" appear everywhere and would dominate the scores.
The TF-IDF family solves this by balancing two forces: how important a word is within a document (term frequency) versus how common it is across documents (inverse document frequency). Words that appear frequently in one document but rarely elsewhere are the best discriminators — and they get the highest scores.
This chapter covers three members of the family: classic TF-IDF, the industry-standard BM25 ranking function, and TF-ICF for classification. Together they power everything from web search engines to spam filters.
TF-IDF (Term Frequency – Inverse Document Frequency)
Think of it this way: A word's importance in a document is the product of two things: how often it appears in that document (TF) and how rare it is across all documents (IDF). The word "algorithm" appearing 5 times in a computer science paper is informative because "algorithm" doesn't appear in most documents. The word "the" appearing 50 times is meaningless because every document has it.
There are several ways to compute term frequency:
- Raw TF: \(\text{count}(t, d) / |d|\) — simple proportion (used above)
- Log-normalized TF: \(1 + \log(1 + \text{count}(t, d))\) — dampens the effect of high counts
- Boolean TF: \(1\) if t appears in d, else \(0\) — ignores repetition entirely
- Augmented TF: \(0.5 + 0.5 \cdot \frac{\text{count}(t, d)}{\max_t \text{count}(t', d)}\) — normalized by the most frequent term
Worked Example
Given a collection of N = 1000 documents, where document d has 200 words and contains "algorithm" 6 times, and "algorithm" appears in 50 of the 1000 documents:
-
Calculate term frequency:
\(\text{TF}(\text{algorithm}, d) = 6 / 200 = 0.03\) -
Calculate inverse document frequency:
\(\text{IDF}(\text{algorithm}) = \log_2(1000 / 50) = \log_2(20) = 4.32\) -
Multiply to get TF-IDF:
\(\text{TF-IDF} = 0.03 \times 4.32\) = 0.1296
Compare this with "the," which might appear in all 1000 documents: \(\text{IDF}(\text{the}) = \log_2(1000/1000) = 0\). No matter how often "the" appears, its TF-IDF is zero — exactly as we want.
BM25 (Okapi BM25)
Think of it this way: TF-IDF has a problem: if a word appears 100 times versus 10 times, TF-IDF gives it 10x the score. But does the 100th occurrence really add as much evidence as the 1st? BM25 says no. It applies diminishing returns to term frequency — after a few occurrences the score saturates. It also normalizes for document length, so long documents don't get unfair advantage just because they contain more words.
- k1 (saturation): When k1 = 0, BM25 becomes a boolean model (only presence/absence matters). As k1 grows, term frequency matters more. At k1 = 1.2 (default), a word appearing 3 times scores about 80% of maximum — additional occurrences help very little.
- b (length normalization): When b = 0, document length is ignored. When b = 1, the score is fully normalized by length relative to the average. At b = 0.75 (default), a document twice the average length needs roughly twice as many query term occurrences to achieve the same score.
Worked Example
Query "machine learning", Document D has 80 words, avgdl = 100, N = 1000 docs. "machine" appears 3 times (df = 200), "learning" appears 2 times (df = 150). Using k1 = 1.2, b = 0.75:
-
Calculate IDF for each query term:
\(\text{IDF}(\text{machine}) = \log_2\!\big(\frac{1000}{200}\big) = \log_2(5) = 2.32\)
\(\text{IDF}(\text{learning}) = \log_2\!\big(\frac{1000}{150}\big) = \log_2(6.67) = 2.74\) -
Calculate the length normalization factor:
\(1 - b + b \cdot \frac{|D|}{\text{avgdl}} = 1 - 0.75 + 0.75 \times \frac{80}{100} = 0.25 + 0.6 = 0.85\) -
Score for "machine" (f = 3):
\(\frac{3 \times (1.2 + 1)}{3 + 1.2 \times 0.85} = \frac{3 \times 2.2}{3 + 1.02} = \frac{6.6}{4.02} = 1.642\)
Contribution: \(2.32 \times 1.642 = 3.81\) -
Score for "learning" (f = 2):
\(\frac{2 \times 2.2}{2 + 1.02} = \frac{4.4}{3.02} = 1.457\)
Contribution: \(2.74 \times 1.457 = 3.99\) -
Total BM25 score:
\(3.81 + 3.99\) = 7.80
TF-ICF (Term Frequency – Inverse Class Frequency)
Think of it this way: TF-IDF asks "how rare is this word across documents?" TF-ICF asks "how rare is this word across classes?" If you're classifying news articles into Sports, Politics, and Science, the word "touchdown" may appear in many sports documents but in zero politics documents. TF-ICF captures this: words confined to a single class get the highest ICF, making them powerful classification features.
Worked Example
We have C = 4 classes: Sports, Politics, Science, Cooking. The word "photosynthesis" appears in 1 class (Science). In a particular science document of 150 words, it appears 3 times:
-
Calculate term frequency:
\(\text{TF}(\text{photosynthesis}, d) = 3 / 150 = 0.02\) -
Calculate inverse class frequency:
\(\text{ICF}(\text{photosynthesis}) = \log_2(4 / 1) = \log_2(4) = 2.0\) -
Multiply:
\(\text{TF-ICF} = 0.02 \times 2.0\) = 0.04
Now compare with "the," which appears in all 4 classes: \(\text{ICF}(\text{the}) = \log_2(4/4) = 0\). As with TF-IDF, common-across-all words are zeroed out, but here the discriminating unit is the class, not the document.
Interactive: TF-IDF & BM25 Calculator
Explore how TF-IDF and BM25 score and rank documents. Enter your own text or use the defaults.
Demo 1: Multi-Document TF-IDF Calculator
Enter text in three documents and a search query. See how TF, IDF, and TF-IDF scores are computed for each query term across documents.
Demo 2: BM25 Ranker
Using the same document inputs, see how BM25 ranks documents by relevance. Adjust k1 and b to see their effect on ranking.
Summary: When to Use Which
| Formula | Key Idea | Strengths | Best For |
|---|---|---|---|
| TF-IDF | TF × IDF | Simple, interpretable, fast | Feature extraction, document-term matrices, keyword extraction |
| BM25 | Saturating TF + length norm | Diminishing returns, length normalization, tunable | Search ranking, document retrieval, first-pass retrieval |
| TF-ICF | TF × Inverse Class Freq | Class-level discrimination | Feature selection for classifiers, topic labeling |
The TF-IDF family shares a common principle: balance local importance (how often a term appears here) against global rarity (how unusual the term is overall). TF-IDF does this most simply, BM25 adds sophistication with saturation and length normalization, and TF-ICF shifts the rarity measure from documents to classes for classification tasks.