Chapter 11

IR Evaluation Metrics

5 formulas Intermediate

What you'll learn

  • How to measure search quality with precision, recall, and F1
  • Why ranking order matters and how MAP captures it
  • How NDCG handles graded relevance beyond binary judgments
  • When to use each metric and the trade-offs between them
Prerequisites

This chapter builds on retrieval concepts from Chapter 3 (TF-IDF / BM25) for understanding how documents are retrieved, and Chapter 5 (Similarity & Distance) for ranking by relevance scores.

Introduction

You've built a search engine. It retrieves documents, ranks them, and shows results to users. But how good is it? Information Retrieval (IR) evaluation metrics answer this fundamental question by comparing your system's ranked results against human-judged relevance.

The metrics in this chapter form the backbone of search evaluation, from web search engines to recommendation systems. They progress from simple set-based measures (precision, recall) to sophisticated rank-aware metrics (MAP, NDCG) that reward putting the best results at the top.

Precision@K

Think of it this way: You search for "machine learning" and the engine returns 10 results. You look at the top 5: three are relevant, two are junk. Your Precision@5 is 3/5 = 0.60. It answers: "Of the results I actually looked at, what fraction were useful?"

Precision at K
$$\text{P@}\textcolor{#d97706}{K} = \frac{|\textcolor{#059669}{\text{relevant}} \cap \textcolor{#2563eb}{\text{retrieved}}_{\text{top } K}|}{\textcolor{#d97706}{K}}$$
relevant ∩ retrieved The set of documents in the top K that are actually relevant (true positives)
K The cutoff rank — how many results from the top we examine

Worked Example

A search for "neural networks" returns these top-5 results (R = relevant, N = not relevant):

  1. Ranked list: [R, R, N, R, N]
    Relevant in top 5: positions 1, 2, 4 → 3 documents
  2. \(\text{P@5} = \frac{3}{5}\) = 0.60
  3. At different cutoffs: \(\text{P@1} = 1.0\),   \(\text{P@2} = 1.0\),   \(\text{P@3} = 0.67\),   \(\text{P@4} = 0.75\)
Range: [0, 1]. A value of 1.0 means every result in the top K was relevant. Precision says nothing about relevant documents you missed — that's recall's job.

Recall@K

Think of it this way: There are 10 relevant documents in the entire collection. Your top-5 results contain 3 of them. Your Recall@5 is 3/10 = 0.30. It answers: "Of everything I should have found, how much did I actually find?"

Recall at K
$$\text{R@}\textcolor{#d97706}{K} = \frac{|\textcolor{#059669}{\text{relevant}} \cap \textcolor{#2563eb}{\text{retrieved}}_{\text{top } K}|}{|\textcolor{#059669}{\text{relevant}}|}$$
|relevant| Total number of relevant documents in the entire collection (known from human judgments)

Worked Example

Using the same ranked list [R, R, N, R, N], and knowing there are 5 relevant documents total in the collection:

  1. Relevant found in top 5: 3 documents
  2. Total relevant in collection: 5 documents
  3. \(\text{R@5} = \frac{3}{5}\) = 0.60
Range: [0, 1]. A value of 1.0 means you found every relevant document. High recall is critical in legal discovery, medical search, and patent search where missing a relevant document has serious consequences.

F1 Score

Think of it this way: A system that returns everything has perfect recall but terrible precision. A system that returns one perfect result has great precision but awful recall. F1 is the harmonic mean — it punishes extreme imbalance between precision and recall, only scoring high when both are good.

F1 Score
$$\text{F1} = 2 \times \frac{\textcolor{#e11d48}{P} \times \textcolor{#2563eb}{R}}{\textcolor{#e11d48}{P} + \textcolor{#2563eb}{R}}$$
P Precision — fraction of retrieved documents that are relevant
R Recall — fraction of relevant documents that were retrieved

Worked Example

With P@5 = 0.60 and R@5 = 0.60 from our earlier examples:

  1. \(\text{F1} = 2 \times \frac{0.60 \times 0.60}{0.60 + 0.60} = 2 \times \frac{0.36}{1.20}\) = 0.60
  2. If instead P = 0.90 and R = 0.10:
    \(\text{F1} = 2 \times \frac{0.90 \times 0.10}{0.90 + 0.10} = 2 \times \frac{0.09}{1.00}\) = 0.18

Notice how the imbalanced case (0.90 vs 0.10) gets a harsh F1 of only 0.18 — the harmonic mean heavily penalizes the weak side.

Range: [0, 1]. F1 = 1 only when both precision and recall are perfect. F1 = 0 when either is zero. The harmonic mean is always less than or equal to the arithmetic mean, making it a conservative measure.

MAP (Mean Average Precision)

Think of it this way: Precision@K treats all positions equally, but users care more about the top results. Average Precision (AP) computes precision at every rank where a relevant document appears, then averages those values. A system that puts all relevant docs at the top gets a higher AP than one that scatters them throughout the list. MAP averages AP across multiple queries.

Average Precision (per query)
$$\text{AP} = \frac{1}{|\textcolor{#059669}{R}|} \sum_{k=1}^{n} \textcolor{#e11d48}{P(k)} \times \textcolor{#2563eb}{\text{rel}(k)}$$
Mean Average Precision (across queries)
$$\text{MAP} = \frac{1}{\textcolor{#d97706}{Q}} \sum_{q=1}^{Q} \text{AP}(q)$$
|R| Total number of relevant documents for this query
P(k) Precision computed at cutoff rank k
rel(k) Binary indicator: 1 if document at rank k is relevant, 0 otherwise
Q Number of queries (for MAP, which averages across queries)

Worked Example

Ranked list: [R, N, R, N, R] with 4 relevant documents total in collection:

  1. At rank 1 (relevant): P(1) = 1/1 = 1.00 → contributes 1.00
    At rank 2 (not relevant): skip
    At rank 3 (relevant): P(3) = 2/3 = 0.667 → contributes 0.667
    At rank 4 (not relevant): skip
    At rank 5 (relevant): P(5) = 3/5 = 0.60 → contributes 0.60
  2. \(\text{AP} = \frac{1}{4}(1.00 + 0.667 + 0.60 + 0)\) = 0.567

The 4th relevant document was not retrieved in the top 5, so it contributes 0 to AP. This penalizes the system for missing it.

vs. Precision@K: MAP is rank-sensitive — moving a relevant document from position 5 to position 1 increases MAP even if Precision@5 stays the same. MAP also accounts for relevant documents not in the result list (they contribute 0).

NDCG (Normalized Discounted Cumulative Gain)

Think of it this way: Not all relevant documents are equally good. A "highly relevant" document (score 3) is much more valuable than a "marginally relevant" one (score 1). NDCG handles this graded relevance and applies a logarithmic discount: the lower a document appears in the ranking, the less its relevance contributes. Finally, it normalizes by the ideal ranking to give a score between 0 and 1.

Discounted Cumulative Gain at K
$$\text{DCG@}\textcolor{#d97706}{K} = \sum_{i=1}^{K} \frac{2^{\textcolor{#059669}{\text{rel}(i)}} - 1}{\log_2(\textcolor{#2563eb}{i} + 1)}$$
Normalized DCG
$$\text{NDCG@}\textcolor{#d97706}{K} = \frac{\textcolor{#e11d48}{\text{DCG@}K}}{\textcolor{#059669}{\text{IDCG@}K}}$$
rel(i) Graded relevance score of the document at rank i (e.g., 0, 1, 2, or 3)
log₂(i+1) Position discount: results further down the list contribute less
IDCG@K Ideal DCG — the DCG of the perfect ranking (sort by relevance descending)

Worked Example

Ranked list with graded relevance: [3, 2, 0, 1, 2] at K=5:

  1. DCG calculation at each position:
      i=1: (2³−1)/log₂(2) = 7/1.00 = 7.00
      i=2: (2²−1)/log₂(3) = 3/1.585 = 1.893
      i=3: (2⁰−1)/log₂(4) = 0/2.00 = 0.00
      i=4: (2¹−1)/log₂(5) = 1/2.322 = 0.431
      i=5: (2²−1)/log₂(6) = 3/2.585 = 1.161
  2. \(\text{DCG@5} = 7.00 + 1.893 + 0 + 0.431 + 1.161\) = 10.484
  3. Ideal ranking: [3, 2, 2, 1, 0] → IDCG@5 = 7.00 + 1.893 + 1.161 + 0.431 + 0 = 10.484
  4. \(\text{NDCG@5} = 10.484 / 10.484\) = 1.00

In this case the ranking was already nearly optimal (only the irrelevant doc at rank 3 wasn't ideal), leading to a very high NDCG. If we had placed the irrelevant doc at rank 1 instead, NDCG would drop substantially.

vs. MAP: MAP uses binary relevance (relevant or not). NDCG handles graded relevance (0, 1, 2, 3, ...). The logarithmic discount in NDCG means errors at the top are punished far more than errors at the bottom.

Interactive: Search Result Evaluator

Adjust the relevance scores for each search result and set the cutoff K. All IR metrics are computed live so you can see how relevance judgments and ranking affect each metric.

Summary: Choosing the Right Metric

Metric Type Position-Aware? Best For
Precision@K Binary No (only uses cutoff) Quick quality check of top results
Recall@K Binary No (only uses cutoff) High-stakes search (legal, medical, patent)
F1 Score Binary No Balancing precision & recall trade-off
MAP Binary Yes (rewards top placement) Overall system comparison (TREC standard)
NDCG Graded Yes (logarithmic discount) Web search, learning-to-rank, multi-level relevance
Key Takeaway

The progression from Precision to NDCG reflects increasing sophistication: first we ask "how many results are good?", then "are the good ones at the top?", and finally "how good are they, and are the best ones at the very top?" Choose the metric that matches your relevance model and what your users care about.