The Road to Transformers
What you'll learn
- How every formula in this manual maps to a specific component inside modern transformers
- The softmax function and why it converts raw scores into probabilities
- How attention works: dot-product similarity scaled by dimension, applied to queries, keys, and values
- Why self-attention creates contextual embeddings that surpass static Word2Vec vectors
- How positional encoding injects word order into a position-agnostic architecture
- Why layer normalization is the z-score from Ch02 with learnable parameters
- How the full transformer block composes these pieces into the engine behind modern AI
Introduction
You have spent eleven chapters building a vocabulary of formulas for measuring word association, similarity, information, and language structure. This appendix reveals a surprising fact: every one of those formulas is alive inside modern transformer models. The architecture behind GPT, BERT, and their descendants is not a break from classical NLP — it is a synthesis of exactly the ideas you have already learned.
This appendix does not attempt a full implementation of the transformer. Instead, it walks through the key mathematical components, showing for each one precisely which earlier chapter it connects to. By the end, you will see that the distance from counting word pairs (Chapter 1) to ChatGPT is shorter than you might think.
We will build up piece by piece: from the softmax function that turns scores into probabilities, through the attention mechanism that is the transformer's core innovation, to the full transformer block that stacks these pieces into the deep networks powering today's language models.
1. The Softmax Function
Think of it this way: You have a list of raw scores (called logits) — maybe [2.0, 1.0, 0.1]. Softmax converts them into a probability distribution that sums to 1, while preserving the relative ranking. Higher scores get higher probabilities; the gap between them is controlled exponentially.
Worked Example
Given logits [2.0, 1.0, 0.1]:
-
Exponentiate each: \(e^{2.0} = 7.389\), \(e^{1.0} = 2.718\), \(e^{0.1} = 1.105\)
-
Sum: \(7.389 + 2.718 + 1.105 = 11.212\)
-
Divide each by the sum: [0.659, 0.242, 0.099]
The highest logit (2.0) gets 65.9% of the probability mass. The probabilities sum to 1.0.
2. Dot-Product Attention
Think of it this way: Imagine a database lookup. You have a query (what you're looking for), a set of keys (labels on the stored items), and values (the stored items themselves). Attention computes how well each key matches the query, then returns a weighted combination of the values — more weight on better matches.
Worked Example
Given 3 tokens with 2-dimensional Q, K, V:
-
\(Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}\), \(K = \begin{bmatrix} 1 & 1 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}\), \(V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}\)
-
Compute \(QK^T\): each entry is a dot product between a query row and a key row.
\(QK^T = \begin{bmatrix} 1 & 0 & 1 \\ 1 & 1 & 0 \\ 2 & 1 & 1 \end{bmatrix}\) -
Apply softmax to each row to get attention weights, then multiply by V to get the output.
3. Scaled Dot-Product Attention
Think of it this way: When vectors have many dimensions, their dot products tend to be large numbers. Large inputs to softmax push it into a regime where almost all the weight goes to a single token (saturation). Dividing by \(\sqrt{d_k}\) rescales the values to a sensible range, keeping the softmax gradient healthy during training.
Why √dk specifically?
If the entries of Q and K are independent random variables with mean 0 and variance 1, then each dot product \(q \cdot k = \sum_{i=1}^{d_k} q_i k_i\) has mean 0 and variance \(d_k\). Dividing by \(\sqrt{d_k}\) brings the variance back to 1, regardless of dimension.
4. Self-Attention
Think of it this way: In "The bank by the river was eroding," how does the model know "bank" means a riverbank and not a financial institution? Self-attention lets every word look at every other word in the same sentence. "Bank" attends to "river" and "eroding," gathering context to disambiguate its meaning. This creates a different representation for "bank" in every sentence it appears in.
5. Positional Encoding
Think of it this way: Self-attention treats the input as a set, not a sequence — "dog bites man" and "man bites dog" would produce the same attention weights. Positional encoding adds a unique signal to each position so the model can distinguish word order. The sinusoidal pattern creates a kind of "binary clock" where different frequencies tick at different rates, giving each position a unique fingerprint.
How the Frequencies Work
Dimension 0 oscillates rapidly (wavelength = 2π), while dimension 511 oscillates very slowly (wavelength = 2π × 10000). This means nearby positions differ mostly in the high-frequency dimensions, while distant positions differ across all dimensions. The model can learn to read these patterns to understand both absolute position and relative distance.
6. Multi-Head Attention
Think of it this way: A single attention head can only focus on one type of relationship at a time. Multi-head attention runs several attention operations in parallel, each with its own learned projections. One head might learn syntactic dependencies (subject-verb), another might learn semantic similarity, and another might track positional patterns. The results are concatenated and projected to form a richer representation.
7. Layer Normalization
Think of it this way: As data flows through many transformer layers, the activations can drift to very large or very small values, making training unstable. Layer normalization recenters and rescales each layer's output to have zero mean and unit variance — exactly the z-score you learned in Chapter 2, plus learnable parameters that let the model undo the normalization if needed.
8. The Transformer Block (Putting It Together)
Think of it this way: The transformer block is a processing pipeline repeated many times (6 layers in the original paper, up to 96+ in modern models like GPT-4). Each block takes a sequence of token representations, lets them communicate via self-attention, processes each token individually through a feed-forward network, and normalizes the results. Residual connections (skip connections) ensure that information and gradients can flow freely even through very deep stacks.
- Input Embedding — Look up each token in a learned embedding table (Ch07)
- Positional Encoding — Add position information to embeddings (this appendix, concept 5)
- Multi-Head Self-Attention — Tokens attend to each other (this appendix, concepts 2–6)
- Add & Layer Norm — Residual connection + z-score normalization (Ch02)
- Feed-Forward Network — Per-token nonlinear transformation
- Add & Layer Norm — Another residual + normalization
- Output — Logits → softmax → probabilities, trained with cross-entropy loss (Ch04)
Interactive Demos
Explore the key transformer components hands-on. Each demo connects back to formulas from earlier chapters.
Demo 1: Softmax Explorer
Enter logit values and adjust the temperature to see how softmax shapes probability distributions.
Demo 2: Attention Matrix Visualizer
Enter a sentence and see the self-attention heatmap: which tokens attend to which. Hover over cells for exact weights.
Demo 3: Transformer Block Diagram
An interactive diagram of the full transformer block. Hover over each component to see its formula and which chapter it connects to. Click to jump to that chapter.
The Full Connection Map
Every major transformer component traces back to a formula you have already learned. This table is the Rosetta Stone connecting classical text analysis to modern deep learning:
| Transformer Component | Formula Used | From Chapter |
|---|---|---|
| Training Loss | Cross-Entropy | Ch04: Information Theory |
| Model Evaluation | Perplexity | Ch04: Information Theory |
| Similarity Scoring | Dot Product / Cosine | Ch05: Similarity & Distance |
| Score → Probability | Softmax | This Appendix |
| Attention Weights | Scaled Dot-Product | This Appendix |
| Input Representations | Word Embeddings | Ch07: Word Embeddings |
| Contextual Embeddings | Self-Attention | This Appendix |
| Weight Normalization | Layer Norm (z-score) | Ch02: Co-occurrence & Association |
| What Transformers Replaced | N-gram Language Models | Ch06: Language Models |
| Word Association (implicit) | PMI Matrix | Ch01: PMI and Its Variants |
| Term Importance (analogous) | TF-IDF Weighting | Ch03: TF-IDF Family |
Summary
The transformer is not a single monolithic invention. It is a carefully composed stack of mathematical operations — every one of which has roots in the classical text analysis formulas covered throughout this manual. Softmax normalizes scores into probabilities. Dot products measure similarity. Layer normalization applies z-scores. Cross-entropy drives the training signal. Embeddings encode meaning.
Every formula in this manual is alive inside modern transformers. PMI matrices became embeddings. TF-IDF became attention. Z-scores became layer normalization. Cross-entropy became the training signal. The journey from counting word pairs to ChatGPT is shorter than you think.
What changed is scale and learning: instead of hand-designed features, transformers learn their representations end-to-end from data, stacking these operations billions of times. What stayed is the mathematics: the same formulas you can compute by hand on a napkin are running trillions of times per second in every inference call to a modern language model.