IR Evaluation Metrics
What you'll learn
- How to measure search quality with precision, recall, and F1
- Why ranking order matters and how MAP captures it
- How NDCG handles graded relevance beyond binary judgments
- When to use each metric and the trade-offs between them
This chapter builds on retrieval concepts from Chapter 3 (TF-IDF / BM25) for understanding how documents are retrieved, and Chapter 5 (Similarity & Distance) for ranking by relevance scores.
Introduction
You've built a search engine. It retrieves documents, ranks them, and shows results to users. But how good is it? Information Retrieval (IR) evaluation metrics answer this fundamental question by comparing your system's ranked results against human-judged relevance.
The metrics in this chapter form the backbone of search evaluation, from web search engines to recommendation systems. They progress from simple set-based measures (precision, recall) to sophisticated rank-aware metrics (MAP, NDCG) that reward putting the best results at the top.
Precision@K
Think of it this way: You search for "machine learning" and the engine returns 10 results. You look at the top 5: three are relevant, two are junk. Your Precision@5 is 3/5 = 0.60. It answers: "Of the results I actually looked at, what fraction were useful?"
Worked Example
A search for "neural networks" returns these top-5 results (R = relevant, N = not relevant):
-
Ranked list: [R, R, N, R, N]
Relevant in top 5: positions 1, 2, 4 → 3 documents -
\(\text{P@5} = \frac{3}{5}\) = 0.60
-
At different cutoffs: \(\text{P@1} = 1.0\), \(\text{P@2} = 1.0\), \(\text{P@3} = 0.67\), \(\text{P@4} = 0.75\)
Recall@K
Think of it this way: There are 10 relevant documents in the entire collection. Your top-5 results contain 3 of them. Your Recall@5 is 3/10 = 0.30. It answers: "Of everything I should have found, how much did I actually find?"
Worked Example
Using the same ranked list [R, R, N, R, N], and knowing there are 5 relevant documents total in the collection:
-
Relevant found in top 5: 3 documents
-
Total relevant in collection: 5 documents
-
\(\text{R@5} = \frac{3}{5}\) = 0.60
F1 Score
Think of it this way: A system that returns everything has perfect recall but terrible precision. A system that returns one perfect result has great precision but awful recall. F1 is the harmonic mean — it punishes extreme imbalance between precision and recall, only scoring high when both are good.
Worked Example
With P@5 = 0.60 and R@5 = 0.60 from our earlier examples:
-
\(\text{F1} = 2 \times \frac{0.60 \times 0.60}{0.60 + 0.60} = 2 \times \frac{0.36}{1.20}\) = 0.60
-
If instead P = 0.90 and R = 0.10:
\(\text{F1} = 2 \times \frac{0.90 \times 0.10}{0.90 + 0.10} = 2 \times \frac{0.09}{1.00}\) = 0.18
Notice how the imbalanced case (0.90 vs 0.10) gets a harsh F1 of only 0.18 — the harmonic mean heavily penalizes the weak side.
MAP (Mean Average Precision)
Think of it this way: Precision@K treats all positions equally, but users care more about the top results. Average Precision (AP) computes precision at every rank where a relevant document appears, then averages those values. A system that puts all relevant docs at the top gets a higher AP than one that scatters them throughout the list. MAP averages AP across multiple queries.
Worked Example
Ranked list: [R, N, R, N, R] with 4 relevant documents total in collection:
-
At rank 1 (relevant): P(1) = 1/1 = 1.00 → contributes 1.00
At rank 2 (not relevant): skip
At rank 3 (relevant): P(3) = 2/3 = 0.667 → contributes 0.667
At rank 4 (not relevant): skip
At rank 5 (relevant): P(5) = 3/5 = 0.60 → contributes 0.60 -
\(\text{AP} = \frac{1}{4}(1.00 + 0.667 + 0.60 + 0)\) = 0.567
The 4th relevant document was not retrieved in the top 5, so it contributes 0 to AP. This penalizes the system for missing it.
NDCG (Normalized Discounted Cumulative Gain)
Think of it this way: Not all relevant documents are equally good. A "highly relevant" document (score 3) is much more valuable than a "marginally relevant" one (score 1). NDCG handles this graded relevance and applies a logarithmic discount: the lower a document appears in the ranking, the less its relevance contributes. Finally, it normalizes by the ideal ranking to give a score between 0 and 1.
Worked Example
Ranked list with graded relevance: [3, 2, 0, 1, 2] at K=5:
-
DCG calculation at each position:
i=1: (2³−1)/log₂(2) = 7/1.00 = 7.00
i=2: (2²−1)/log₂(3) = 3/1.585 = 1.893
i=3: (2⁰−1)/log₂(4) = 0/2.00 = 0.00
i=4: (2¹−1)/log₂(5) = 1/2.322 = 0.431
i=5: (2²−1)/log₂(6) = 3/2.585 = 1.161 -
\(\text{DCG@5} = 7.00 + 1.893 + 0 + 0.431 + 1.161\) = 10.484
-
Ideal ranking: [3, 2, 2, 1, 0] → IDCG@5 = 7.00 + 1.893 + 1.161 + 0.431 + 0 = 10.484
-
\(\text{NDCG@5} = 10.484 / 10.484\) = 1.00
In this case the ranking was already nearly optimal (only the irrelevant doc at rank 3 wasn't ideal), leading to a very high NDCG. If we had placed the irrelevant doc at rank 1 instead, NDCG would drop substantially.
Interactive: Search Result Evaluator
Adjust the relevance scores for each search result and set the cutoff K. All IR metrics are computed live so you can see how relevance judgments and ranking affect each metric.
Summary: Choosing the Right Metric
| Metric | Type | Position-Aware? | Best For |
|---|---|---|---|
| Precision@K | Binary | No (only uses cutoff) | Quick quality check of top results |
| Recall@K | Binary | No (only uses cutoff) | High-stakes search (legal, medical, patent) |
| F1 Score | Binary | No | Balancing precision & recall trade-off |
| MAP | Binary | Yes (rewards top placement) | Overall system comparison (TREC standard) |
| NDCG | Graded | Yes (logarithmic discount) | Web search, learning-to-rank, multi-level relevance |
The progression from Precision to NDCG reflects increasing sophistication: first we ask "how many results are good?", then "are the good ones at the top?", and finally "how good are they, and are the best ones at the very top?" Choose the metric that matches your relevance model and what your users care about.