Word Embeddings
What you'll learn
- How Word2Vec learns dense word vectors from raw text using two architectures: CBOW and Skip-gram
- Why GloVe is a factorized PMI matrix and how it connects back to Chapter 1
- How word analogies emerge from the geometry of embedding spaces
- How to generate training pairs from a sliding context window
Introduction
In earlier chapters, we represented word relationships through co-occurrence counts and PMI scores. These methods produce sparse vectors: one dimension per word in the vocabulary, mostly filled with zeros. Word embeddings compress these relationships into dense vectors of 50–300 dimensions, where every dimension carries information.
The two landmark approaches — Word2Vec (2013) and GloVe (2014) — arrive at similar results from opposite directions. Word2Vec predicts words from their context using a shallow neural network. GloVe directly factorizes a matrix of log co-occurrence counts. Remarkably, Levy and Goldberg (2014) proved these are mathematically equivalent: Word2Vec with negative sampling implicitly factorizes a shifted PMI matrix. The circle that began in Chapter 1 closes here.
Word2Vec: Continuous Bag of Words (CBOW)
Think of it this way: You see the words "the cat ___ on the mat" and your brain immediately fills in "sat." CBOW does the same thing: given the surrounding context words, it predicts the center word. The word vectors are trained so that averaging the context vectors points toward the target word in embedding space.
Worked Example (Conceptual)
Sentence: "the cat sat on the mat" — window size c = 2, target = "sat"
-
Collect context words: {"the", "cat", "on", "the"}
(2 words before and 2 words after "sat") -
Look up their input vectors and average them:
\(\mathbf{h} = \frac{1}{4}(\mathbf{u}_{\text{the}} + \mathbf{u}_{\text{cat}} + \mathbf{u}_{\text{on}} + \mathbf{u}_{\text{the}})\) -
Compute dot product of h with every output vector in vocabulary, apply softmax:
\(P(\text{sat} \mid \text{context}) = \text{softmax}(\mathbf{v}_{\text{sat}} \cdot \mathbf{h})\) Maximize this probability via gradient descent
Over many such windows across a large corpus, the vectors gradually encode semantic meaning.
Word2Vec: Skip-gram with Negative Sampling
Think of it this way: Given the word "sat," can you guess which words appeared nearby? Skip-gram flips CBOW around: from one center word, predict each surrounding context word. This generates more training examples per window, which helps with rare words.
Worked Example: Training Pair Generation
Sentence: "the cat sat on the mat" — window size c = 2
-
For center word "sat" (position 2), generate pairs with each context word:
(sat, the), (sat, cat), (sat, on), (sat, the) -
For each positive pair, sample k negative words randomly from the vocabulary.
E.g., for (sat, cat): negatives might be {fish, market, cream, love, stock} -
Maximize: \(\log \sigma(\mathbf{v}_{\text{cat}} \cdot \mathbf{v}_{\text{sat}}) + \sum \log \sigma(-\mathbf{v}_{\text{neg}} \cdot \mathbf{v}_{\text{sat}})\) Push real context words closer, push random words apart
GloVe: Global Vectors
Think of it this way: If "ice" and "cream" co-occur 80 times in your corpus, then the dot product of their word vectors should approximate log(80). GloVe directly optimizes for this relationship: make vector dot products match log co-occurrence counts. It combines the efficiency of count-based methods with the performance of prediction-based ones.
Worked Example (Conceptual)
Suppose words "ice" (i) and "cream" (j) co-occur Xij = 80 times in the corpus.
-
The target for the dot product is:
\(\mathbf{w}_{\text{ice}} \cdot \widetilde{\mathbf{w}}_{\text{cream}} + b_{\text{ice}} + \widetilde{b}_{\text{cream}} \approx \log(80) = 4.38\) -
Compute the weight: \(f(80) = (80/100)^{0.75} = 0.8^{0.75} \approx 0.846\)
(slightly downweighted since 80 < xmax = 100) -
Contribution to loss:
\(0.846 \times (\mathbf{w}_{\text{ice}} \cdot \widetilde{\mathbf{w}}_{\text{cream}} + b_{\text{ice}} + \widetilde{b}_{\text{cream}} - 4.38)^2\) Minimize this squared error over all word pairs
Interactive: Skip-gram Training Pairs
Enter text and adjust the window size to see how Skip-gram generates (center, context) training pairs from a sliding window. Click a row to highlight that window position in the text.
Interactive: Word Analogy Tester
The famous word analogy test: "king - man + woman = queen." This demo uses a small set of pre-computed 2D word vectors to demonstrate how vector arithmetic captures semantic relationships. Enter three words (A, B, C) and the system finds D such that A:B :: C:D by computing B − A + C and finding the nearest vector.
Summary: Comparing Embedding Approaches
| Method | Type | Core Idea | Strengths |
|---|---|---|---|
| CBOW | Prediction-based | Predict center word from context | Fast training, good for frequent words |
| Skip-gram | Prediction-based | Predict context from center word | Better for rare words, more training pairs per window |
| GloVe | Count-based (factorized) | Dot product ≈ log co-occurrence | Leverages global statistics, interpretable objective |
Word2Vec and GloVe are two sides of the same coin. Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix; GloVe explicitly factorizes a log co-occurrence matrix (which is closely related to PMI). Both produce dense vectors where semantic similarity corresponds to geometric closeness — and both have their roots in the PMI formulas from Chapter 1.