Chapter 07

Word Embeddings

3 formulas Intermediate

What you'll learn

  • How Word2Vec learns dense word vectors from raw text using two architectures: CBOW and Skip-gram
  • Why GloVe is a factorized PMI matrix and how it connects back to Chapter 1
  • How word analogies emerge from the geometry of embedding spaces
  • How to generate training pairs from a sliding context window
Prerequisites: Ch01 (PMI) for understanding the co-occurrence foundations, and Ch05 (Cosine Similarity) for measuring word vector closeness.

Introduction

In earlier chapters, we represented word relationships through co-occurrence counts and PMI scores. These methods produce sparse vectors: one dimension per word in the vocabulary, mostly filled with zeros. Word embeddings compress these relationships into dense vectors of 50–300 dimensions, where every dimension carries information.

The two landmark approaches — Word2Vec (2013) and GloVe (2014) — arrive at similar results from opposite directions. Word2Vec predicts words from their context using a shallow neural network. GloVe directly factorizes a matrix of log co-occurrence counts. Remarkably, Levy and Goldberg (2014) proved these are mathematically equivalent: Word2Vec with negative sampling implicitly factorizes a shifted PMI matrix. The circle that began in Chapter 1 closes here.

Word2Vec: Continuous Bag of Words (CBOW)

Think of it this way: You see the words "the cat ___ on the mat" and your brain immediately fills in "sat." CBOW does the same thing: given the surrounding context words, it predicts the center word. The word vectors are trained so that averaging the context vectors points toward the target word in embedding space.

CBOW Objective
$$\mathcal{L}_{\text{CBOW}} = -\log P(\textcolor{#e11d48}{w_t} \mid \textcolor{#2563eb}{w_{t-c}}, \ldots, \textcolor{#2563eb}{w_{t+c}})$$
Prediction via softmax
$$P(\textcolor{#e11d48}{w_t} \mid \text{context}) = \frac{\exp(\textcolor{#7c3aed}{\mathbf{v}_{w_t}} \cdot \textcolor{#059669}{\mathbf{h}})}{\displaystyle\sum_{w \in V} \exp(\mathbf{v}_w \cdot \textcolor{#059669}{\mathbf{h}})}$$
Hidden layer (average of context vectors)
$$\textcolor{#059669}{\mathbf{h}} = \frac{1}{2c} \sum_{-c \le j \le c,\, j \neq 0} \textcolor{#2563eb}{\mathbf{u}_{w_{t+j}}}$$
wt The target (center) word to predict
wt+j Context words within window of size c on each side
h Hidden layer: the average of all context word input vectors
vw Output vector for word w (separate from input vectors uw)

Worked Example (Conceptual)

Sentence: "the cat sat on the mat" — window size c = 2, target = "sat"

  1. Collect context words: {"the", "cat", "on", "the"}
    (2 words before and 2 words after "sat")
  2. Look up their input vectors and average them:
    \(\mathbf{h} = \frac{1}{4}(\mathbf{u}_{\text{the}} + \mathbf{u}_{\text{cat}} + \mathbf{u}_{\text{on}} + \mathbf{u}_{\text{the}})\)
  3. Compute dot product of h with every output vector in vocabulary, apply softmax:
    \(P(\text{sat} \mid \text{context}) = \text{softmax}(\mathbf{v}_{\text{sat}} \cdot \mathbf{h})\) Maximize this probability via gradient descent

Over many such windows across a large corpus, the vectors gradually encode semantic meaning.

CBOW vs. Skip-gram: CBOW is faster to train (one prediction per window) and works well for frequent words. Skip-gram makes separate predictions for each context word, giving it more training signal for rare words.

Word2Vec: Skip-gram with Negative Sampling

Think of it this way: Given the word "sat," can you guess which words appeared nearby? Skip-gram flips CBOW around: from one center word, predict each surrounding context word. This generates more training examples per window, which helps with rare words.

Skip-gram Objective (full softmax)
$$\mathcal{L}_{\text{SG}} = -\frac{1}{T}\sum_{t=1}^{T} \sum_{\substack{-c \le j \le c \\ j \neq 0}} \log P(\textcolor{#2563eb}{w_{t+j}} \mid \textcolor{#e11d48}{w_t})$$
Skip-gram with Negative Sampling (SGNS)
$$\mathcal{L}_{\text{SGNS}} = \log \sigma(\textcolor{#2563eb}{\mathbf{v}_c} \cdot \textcolor{#e11d48}{\mathbf{v}_w}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n} \Big[\log \sigma(-\textcolor{#2563eb}{\mathbf{v}_c} \cdot \textcolor{#d97706}{\mathbf{v}_{w_i}})\Big]$$
vw Vector for the target (center) word
vc Vector for a true context word (positive example)
vwi Vectors for randomly sampled negative examples (words NOT in the context)
σ Sigmoid function: σ(x) = 1/(1 + e-x). Turns dot product into a probability.
k Number of negative samples (typically 5–15)

Worked Example: Training Pair Generation

Sentence: "the cat sat on the mat" — window size c = 2

  1. For center word "sat" (position 2), generate pairs with each context word:
    (sat, the), (sat, cat), (sat, on), (sat, the)
  2. For each positive pair, sample k negative words randomly from the vocabulary.
    E.g., for (sat, cat): negatives might be {fish, market, cream, love, stock}
  3. Maximize: \(\log \sigma(\mathbf{v}_{\text{cat}} \cdot \mathbf{v}_{\text{sat}}) + \sum \log \sigma(-\mathbf{v}_{\text{neg}} \cdot \mathbf{v}_{\text{sat}})\) Push real context words closer, push random words apart
Why negative sampling? The full softmax requires computing a dot product with every word in the vocabulary (often 100K+ words) for each training step. Negative sampling approximates this by contrasting a few random "negative" words instead, making training orders of magnitude faster.

GloVe: Global Vectors

Think of it this way: If "ice" and "cream" co-occur 80 times in your corpus, then the dot product of their word vectors should approximate log(80). GloVe directly optimizes for this relationship: make vector dot products match log co-occurrence counts. It combines the efficiency of count-based methods with the performance of prediction-based ones.

GloVe Objective
$$J = \sum_{i,j=1}^{|V|} \textcolor{#059669}{f(X_{ij})} \Big(\textcolor{#e11d48}{\mathbf{w}_i} \cdot \textcolor{#2563eb}{\widetilde{\mathbf{w}}_j} + \textcolor{#d97706}{b_i} + \textcolor{#d97706}{\widetilde{b}_j} - \log \textcolor{#7c3aed}{X_{ij}}\Big)^2$$
Weighting function
$$\textcolor{#059669}{f(x)} = \begin{cases} (x / x_{\max})^{\alpha} & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}$$
wi Word vector for word i (target embedding)
&wt;j Context vector for word j (separate set of embeddings)
Xij Co-occurrence count: how often words i and j appear together within a window
bi, b̃j Bias terms that absorb word-frequency effects
f(Xij) Weighting function: downweights very frequent pairs (xmax = 100, α = 0.75 typically)

Worked Example (Conceptual)

Suppose words "ice" (i) and "cream" (j) co-occur Xij = 80 times in the corpus.

  1. The target for the dot product is:
    \(\mathbf{w}_{\text{ice}} \cdot \widetilde{\mathbf{w}}_{\text{cream}} + b_{\text{ice}} + \widetilde{b}_{\text{cream}} \approx \log(80) = 4.38\)
  2. Compute the weight: \(f(80) = (80/100)^{0.75} = 0.8^{0.75} \approx 0.846\)
    (slightly downweighted since 80 < xmax = 100)
  3. Contribution to loss:
    \(0.846 \times (\mathbf{w}_{\text{ice}} \cdot \widetilde{\mathbf{w}}_{\text{cream}} + b_{\text{ice}} + \widetilde{b}_{\text{cream}} - 4.38)^2\) Minimize this squared error over all word pairs
GloVe = Factorized PMI: When the bias terms absorb the marginal log-probabilities, the GloVe objective reduces to: \(\mathbf{w}_i \cdot \widetilde{\mathbf{w}}_j \approx \text{PMI}(i, j)\). GloVe is literally factorizing a weighted PMI matrix into low-rank word vectors. This is why PMI from Chapter 1 is the conceptual foundation for modern embeddings.

Interactive: Skip-gram Training Pairs

Enter text and adjust the window size to see how Skip-gram generates (center, context) training pairs from a sliding window. Click a row to highlight that window position in the text.

Interactive: Word Analogy Tester

The famous word analogy test: "king - man + woman = queen." This demo uses a small set of pre-computed 2D word vectors to demonstrate how vector arithmetic captures semantic relationships. Enter three words (A, B, C) and the system finds D such that A:B :: C:D by computing B − A + C and finding the nearest vector.

Summary: Comparing Embedding Approaches

Method Type Core Idea Strengths
CBOW Prediction-based Predict center word from context Fast training, good for frequent words
Skip-gram Prediction-based Predict context from center word Better for rare words, more training pairs per window
GloVe Count-based (factorized) Dot product ≈ log co-occurrence Leverages global statistics, interpretable objective
Key Takeaway

Word2Vec and GloVe are two sides of the same coin. Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix; GloVe explicitly factorizes a log co-occurrence matrix (which is closely related to PMI). Both produce dense vectors where semantic similarity corresponds to geometric closeness — and both have their roots in the PMI formulas from Chapter 1.