Chapter 05

Similarity & Distance

4 formulas Intermediate

What you'll learn

  • How cosine similarity measures the angle between vectors, ignoring magnitude
  • When Euclidean distance is appropriate and when it misleads
  • Why Manhattan distance is more robust in high dimensions
  • How Minkowski distance generalizes both Euclidean and Manhattan
Prerequisites: Chapter 03 (TF-IDF Vectors) — you should be comfortable with the idea of representing documents as vectors in a high-dimensional word space.

Introduction

Once you represent text as vectors (term frequencies, TF-IDF weights, or embeddings), the next question is: how similar are two vectors? This chapter covers the four most important ways to answer that question.

The core insight is that "similarity" and "distance" are two sides of the same coin. High similarity means low distance, and vice versa. But the choice of metric matters enormously: cosine similarity ignores how long the vectors are (great for comparing documents of different lengths), while Euclidean distance cares about magnitude (better for comparing points in a fixed space).

We will build from the most specific metrics up to the general case: Minkowski distance unifies Euclidean and Manhattan into a single family controlled by one parameter, p.

Cosine Similarity

Think of it this way: Imagine two arrows pointing out from the origin. Cosine similarity measures the angle between them, ignoring how long each arrow is. Two documents about the same topic will point in roughly the same direction even if one is a tweet and the other is a novel — they differ in magnitude (word counts), not direction (word proportions).

Cosine Similarity
$$\cos(\theta) = \frac{\textcolor{#059669}{\mathbf{A} \cdot \mathbf{B}}}{\textcolor{#e11d48}{\|\mathbf{A}\|} \times \textcolor{#2563eb}{\|\mathbf{B}\|}} = \frac{\textcolor{#059669}{\sum_{i=1}^{n} a_i b_i}}{\textcolor{#e11d48}{\sqrt{\sum_{i=1}^{n} a_i^2}} \times \textcolor{#2563eb}{\sqrt{\sum_{i=1}^{n} b_i^2}}}$$
A · B Dot product — sum of element-wise products. Measures how much the vectors overlap.
||A|| Magnitude (L2 norm) of vector A — its "length" in n-dimensional space.
||B|| Magnitude (L2 norm) of vector B.

Worked Example

Given two TF vectors: A = (3, 2, 0, 1) and B = (1, 0, 2, 1) over vocabulary {cat, dog, fish, the}:

  1. Dot product:
    \(\mathbf{A} \cdot \mathbf{B} = (3)(1) + (2)(0) + (0)(2) + (1)(1) = 3 + 0 + 0 + 1 = 4\)
  2. Magnitudes:
    \(\|\mathbf{A}\| = \sqrt{9 + 4 + 0 + 1} = \sqrt{14} \approx 3.742\)
    \(\|\mathbf{B}\| = \sqrt{1 + 0 + 4 + 1} = \sqrt{6} \approx 2.449\)
  3. Cosine similarity:
    \(\cos(\theta) = \frac{4}{3.742 \times 2.449} = \frac{4}{9.165}\) = 0.4364

A cosine of 0.44 indicates moderate similarity — the documents share some vocabulary but differ significantly.

Range: For non-negative vectors (like TF or TF-IDF), cosine similarity ranges from 0 (orthogonal, no overlap) to 1 (identical direction). In general, cosine ranges from -1 (opposite) to +1 (identical).

Euclidean Distance

Think of it this way: This is the "as the crow flies" distance — the length of a straight line between two points. In 2D, it is the Pythagorean theorem. In higher dimensions, it is the same idea extended. Unlike cosine, magnitude matters: a long document far from the origin will be "far" from a short document near the origin, even if they point in the same direction.

Euclidean Distance (L2)
$$d(\textcolor{#e11d48}{\mathbf{A}}, \textcolor{#2563eb}{\mathbf{B}}) = \sqrt{\sum_{i=1}^{n} (\textcolor{#e11d48}{a_i} - \textcolor{#2563eb}{b_i})^2}$$
ai The i-th component of vector A.
bi The i-th component of vector B.
n Dimensionality of the vectors (vocabulary size for text).

Worked Example

Using the same vectors: A = (3, 2, 0, 1) and B = (1, 0, 2, 1):

  1. Element-wise differences squared:
    \((3-1)^2 + (2-0)^2 + (0-2)^2 + (1-1)^2 = 4 + 4 + 4 + 0 = 12\)
  2. Take the square root:
    \(d = \sqrt{12}\) = 3.464

The straight-line distance between these two vectors is 3.464 units.

vs. Cosine: Euclidean distance is sensitive to vector magnitudes. Two identical documents with different lengths will have zero cosine distance but nonzero Euclidean distance. Use Euclidean when magnitudes carry meaningful information (e.g., normalized embeddings, low-dimensional projections).

Manhattan Distance

Think of it this way: Imagine navigating a city grid — you can only walk along streets (horizontal or vertical), never diagonally. Manhattan distance is the total number of blocks you walk. It sums absolute differences instead of squaring them, making it less sensitive to outlier dimensions.

Manhattan Distance (L1)
$$d(\textcolor{#e11d48}{\mathbf{A}}, \textcolor{#2563eb}{\mathbf{B}}) = \sum_{i=1}^{n} |\textcolor{#e11d48}{a_i} - \textcolor{#2563eb}{b_i}|$$
ai The i-th component of vector A.
bi The i-th component of vector B.
|·| Absolute value — no squaring, so large differences in one dimension are not amplified.

Worked Example

Using A = (3, 2, 0, 1) and B = (1, 0, 2, 1):

  1. Absolute differences:
    \(|3-1| + |2-0| + |0-2| + |1-1| = 2 + 2 + 2 + 0 = 6\) = 6

The city-block distance is 6 — notably larger than the Euclidean distance of 3.464, because Manhattan cannot take diagonal shortcuts.

vs. Euclidean: Manhattan ≥ Euclidean, always. They are equal only when differences exist in exactly one dimension. Manhattan is more robust in high dimensions because it does not square differences — a single large difference does not dominate.

Minkowski Distance

Think of it this way: Minkowski distance is the general family that contains both Manhattan and Euclidean as special cases. A single parameter p controls the shape: p=1 gives Manhattan, p=2 gives Euclidean, and as p approaches infinity, only the single largest difference matters (Chebyshev distance). It is the "dial" that lets you tune how much large differences are penalized.

Minkowski Distance (Lp)
$$d_{\textcolor{#d97706}{p}}(\textcolor{#e11d48}{\mathbf{A}}, \textcolor{#2563eb}{\mathbf{B}}) = \left(\sum_{i=1}^{n} |\textcolor{#e11d48}{a_i} - \textcolor{#2563eb}{b_i}|^{\textcolor{#d97706}{p}}\right)^{1/\textcolor{#d97706}{p}}$$
p Order parameter. p=1 is Manhattan, p=2 is Euclidean, p→∞ is Chebyshev (max absolute difference).
ai, bi Components of vectors A and B.

Worked Example

Using A = (3, 2, 0, 1) and B = (1, 0, 2, 1) with p = 3:

  1. Absolute differences raised to power p=3:
    \(|3-1|^3 + |2-0|^3 + |0-2|^3 + |1-1|^3 = 8 + 8 + 8 + 0 = 24\)
  2. Take the p-th root:
    \(d_3 = 24^{1/3}\) = 2.884

Compare across the spectrum: Manhattan (p=1) = 6, Euclidean (p=2) = 3.464, Minkowski p=3 = 2.884. As p increases, the distance decreases toward the max single-dimension difference (which is 2).

The hierarchy: Minkowski is the generalization. Manhattan and Euclidean are not separate formulas — they are Minkowski with p=1 and p=2 respectively. Understanding this unification is the key insight of this chapter.

Interactive: 2D Vector Plotter

Enter two 2D vectors below to visualize them as arrows from the origin. The diagram shows the cosine angle, Euclidean distance (dashed line), and Manhattan path (stepped line), with computed values for all metrics.

Interactive: Distance Metric Comparison

Enter two text snippets below. They are converted to term-frequency vectors over a shared vocabulary, and all four distance/similarity metrics are computed. Adjust the Minkowski p slider to see how the generalized distance changes.

Summary: When to Use Which

Metric Type Range Magnitude-Sensitive Best For
Cosine Similarity Similarity [0, 1] for text No Document comparison, search, clustering on raw TF/TF-IDF vectors
Euclidean (L2) Distance [0, ∞) Yes Low-dimensional embeddings, k-means clustering, kNN
Manhattan (L1) Distance [0, ∞) Yes High-dimensional sparse vectors, robust comparisons
Minkowski (Lp) Distance [0, ∞) Yes Tunable generalization — experiment with p as a hyperparameter
Key Takeaway

Cosine similarity is the go-to metric for text because it ignores document length. Euclidean and Manhattan are better suited to normalized or low-dimensional vectors. Minkowski unifies the distance family with a single parameter p — understanding this hierarchy means you can always pick the right tool for the job.