Similarity & Distance
What you'll learn
- How cosine similarity measures the angle between vectors, ignoring magnitude
- When Euclidean distance is appropriate and when it misleads
- Why Manhattan distance is more robust in high dimensions
- How Minkowski distance generalizes both Euclidean and Manhattan
Introduction
Once you represent text as vectors (term frequencies, TF-IDF weights, or embeddings), the next question is: how similar are two vectors? This chapter covers the four most important ways to answer that question.
The core insight is that "similarity" and "distance" are two sides of the same coin. High similarity means low distance, and vice versa. But the choice of metric matters enormously: cosine similarity ignores how long the vectors are (great for comparing documents of different lengths), while Euclidean distance cares about magnitude (better for comparing points in a fixed space).
We will build from the most specific metrics up to the general case: Minkowski distance unifies Euclidean and Manhattan into a single family controlled by one parameter, p.
Cosine Similarity
Think of it this way: Imagine two arrows pointing out from the origin. Cosine similarity measures the angle between them, ignoring how long each arrow is. Two documents about the same topic will point in roughly the same direction even if one is a tweet and the other is a novel — they differ in magnitude (word counts), not direction (word proportions).
Worked Example
Given two TF vectors: A = (3, 2, 0, 1) and B = (1, 0, 2, 1) over vocabulary {cat, dog, fish, the}:
-
Dot product:
\(\mathbf{A} \cdot \mathbf{B} = (3)(1) + (2)(0) + (0)(2) + (1)(1) = 3 + 0 + 0 + 1 = 4\) -
Magnitudes:
\(\|\mathbf{A}\| = \sqrt{9 + 4 + 0 + 1} = \sqrt{14} \approx 3.742\)
\(\|\mathbf{B}\| = \sqrt{1 + 0 + 4 + 1} = \sqrt{6} \approx 2.449\) -
Cosine similarity:
\(\cos(\theta) = \frac{4}{3.742 \times 2.449} = \frac{4}{9.165}\) = 0.4364
A cosine of 0.44 indicates moderate similarity — the documents share some vocabulary but differ significantly.
Euclidean Distance
Think of it this way: This is the "as the crow flies" distance — the length of a straight line between two points. In 2D, it is the Pythagorean theorem. In higher dimensions, it is the same idea extended. Unlike cosine, magnitude matters: a long document far from the origin will be "far" from a short document near the origin, even if they point in the same direction.
Worked Example
Using the same vectors: A = (3, 2, 0, 1) and B = (1, 0, 2, 1):
-
Element-wise differences squared:
\((3-1)^2 + (2-0)^2 + (0-2)^2 + (1-1)^2 = 4 + 4 + 4 + 0 = 12\) -
Take the square root:
\(d = \sqrt{12}\) = 3.464
The straight-line distance between these two vectors is 3.464 units.
Manhattan Distance
Think of it this way: Imagine navigating a city grid — you can only walk along streets (horizontal or vertical), never diagonally. Manhattan distance is the total number of blocks you walk. It sums absolute differences instead of squaring them, making it less sensitive to outlier dimensions.
Worked Example
Using A = (3, 2, 0, 1) and B = (1, 0, 2, 1):
-
Absolute differences:
\(|3-1| + |2-0| + |0-2| + |1-1| = 2 + 2 + 2 + 0 = 6\) = 6
The city-block distance is 6 — notably larger than the Euclidean distance of 3.464, because Manhattan cannot take diagonal shortcuts.
Minkowski Distance
Think of it this way: Minkowski distance is the general family that contains both Manhattan and Euclidean as special cases. A single parameter p controls the shape: p=1 gives Manhattan, p=2 gives Euclidean, and as p approaches infinity, only the single largest difference matters (Chebyshev distance). It is the "dial" that lets you tune how much large differences are penalized.
Worked Example
Using A = (3, 2, 0, 1) and B = (1, 0, 2, 1) with p = 3:
-
Absolute differences raised to power p=3:
\(|3-1|^3 + |2-0|^3 + |0-2|^3 + |1-1|^3 = 8 + 8 + 8 + 0 = 24\) -
Take the p-th root:
\(d_3 = 24^{1/3}\) = 2.884
Compare across the spectrum: Manhattan (p=1) = 6, Euclidean (p=2) = 3.464, Minkowski p=3 = 2.884. As p increases, the distance decreases toward the max single-dimension difference (which is 2).
Interactive: 2D Vector Plotter
Enter two 2D vectors below to visualize them as arrows from the origin. The diagram shows the cosine angle, Euclidean distance (dashed line), and Manhattan path (stepped line), with computed values for all metrics.
Interactive: Distance Metric Comparison
Enter two text snippets below. They are converted to term-frequency vectors over a shared vocabulary, and all four distance/similarity metrics are computed. Adjust the Minkowski p slider to see how the generalized distance changes.
Summary: When to Use Which
| Metric | Type | Range | Magnitude-Sensitive | Best For |
|---|---|---|---|---|
| Cosine Similarity | Similarity | [0, 1] for text | No | Document comparison, search, clustering on raw TF/TF-IDF vectors |
| Euclidean (L2) | Distance | [0, ∞) | Yes | Low-dimensional embeddings, k-means clustering, kNN |
| Manhattan (L1) | Distance | [0, ∞) | Yes | High-dimensional sparse vectors, robust comparisons |
| Minkowski (Lp) | Distance | [0, ∞) | Yes | Tunable generalization — experiment with p as a hyperparameter |
Cosine similarity is the go-to metric for text because it ignores document length. Euclidean and Manhattan are better suited to normalized or low-dimensional vectors. Minkowski unifies the distance family with a single parameter p — understanding this hierarchy means you can always pick the right tool for the job.