Appendix

The Road to Transformers

8 concepts Advanced

What you'll learn

  • How every formula in this manual maps to a specific component inside modern transformers
  • The softmax function and why it converts raw scores into probabilities
  • How attention works: dot-product similarity scaled by dimension, applied to queries, keys, and values
  • Why self-attention creates contextual embeddings that surpass static Word2Vec vectors
  • How positional encoding injects word order into a position-agnostic architecture
  • Why layer normalization is the z-score from Ch02 with learnable parameters
  • How the full transformer block composes these pieces into the engine behind modern AI

Introduction

You have spent eleven chapters building a vocabulary of formulas for measuring word association, similarity, information, and language structure. This appendix reveals a surprising fact: every one of those formulas is alive inside modern transformer models. The architecture behind GPT, BERT, and their descendants is not a break from classical NLP — it is a synthesis of exactly the ideas you have already learned.

This appendix does not attempt a full implementation of the transformer. Instead, it walks through the key mathematical components, showing for each one precisely which earlier chapter it connects to. By the end, you will see that the distance from counting word pairs (Chapter 1) to ChatGPT is shorter than you might think.

We will build up piece by piece: from the softmax function that turns scores into probabilities, through the attention mechanism that is the transformer's core innovation, to the full transformer block that stacks these pieces into the deep networks powering today's language models.

1. The Softmax Function

Think of it this way: You have a list of raw scores (called logits) — maybe [2.0, 1.0, 0.1]. Softmax converts them into a probability distribution that sums to 1, while preserving the relative ranking. Higher scores get higher probabilities; the gap between them is controlled exponentially.

Softmax Function
$$\text{softmax}(\textcolor{#2563eb}{z_i}) = \frac{e^{\textcolor{#2563eb}{z_i}}}{\sum_{j=1}^{K} e^{\textcolor{#e11d48}{z_j}}}$$
zi The logit (raw score) for class i
zj All logits in the vector, summed in the denominator
K Total number of classes (vocabulary size in language models)

Worked Example

Given logits [2.0, 1.0, 0.1]:

  1. Exponentiate each: \(e^{2.0} = 7.389\), \(e^{1.0} = 2.718\), \(e^{0.1} = 1.105\)
  2. Sum: \(7.389 + 2.718 + 1.105 = 11.212\)
  3. Divide each by the sum: [0.659, 0.242, 0.099]

The highest logit (2.0) gets 65.9% of the probability mass. The probabilities sum to 1.0.

Connection to prior chapters: Softmax is the bridge between raw scores and the probability distributions that Ch04: Cross-Entropy operates on. In transformers, the model outputs logits, softmax converts them to probabilities, and cross-entropy measures how far those probabilities are from the true next word. Softmax also appears inside attention (next section) to convert similarity scores into weights.

2. Dot-Product Attention

Think of it this way: Imagine a database lookup. You have a query (what you're looking for), a set of keys (labels on the stored items), and values (the stored items themselves). Attention computes how well each key matches the query, then returns a weighted combination of the values — more weight on better matches.

Dot-Product Attention
$$\text{Attention}(\textcolor{#e11d48}{Q}, \textcolor{#2563eb}{K}, \textcolor{#059669}{V}) = \text{softmax}\!\Big(\textcolor{#e11d48}{Q}\textcolor{#2563eb}{K}^T\Big)\textcolor{#059669}{V}$$
Q Query matrix — what each token is "looking for"
K Key matrix — what each token "advertises" about itself
V Value matrix — the actual content each token contributes

Worked Example

Given 3 tokens with 2-dimensional Q, K, V:

  1. \(Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}\), \(K = \begin{bmatrix} 1 & 1 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}\), \(V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}\)
  2. Compute \(QK^T\): each entry is a dot product between a query row and a key row.
    \(QK^T = \begin{bmatrix} 1 & 0 & 1 \\ 1 & 1 & 0 \\ 2 & 1 & 1 \end{bmatrix}\)
  3. Apply softmax to each row to get attention weights, then multiply by V to get the output.
Connection to prior chapters: The \(QK^T\) operation is a dot product — the same similarity measure from Ch05: Cosine Similarity (without the normalization). Softmax then converts these similarity scores into weights, functioning like a learned, context-dependent version of Ch03: TF-IDF weighting — instead of static term importance, the model dynamically decides which tokens matter for each query.

3. Scaled Dot-Product Attention

Think of it this way: When vectors have many dimensions, their dot products tend to be large numbers. Large inputs to softmax push it into a regime where almost all the weight goes to a single token (saturation). Dividing by \(\sqrt{d_k}\) rescales the values to a sensible range, keeping the softmax gradient healthy during training.

Scaled Dot-Product Attention
$$\text{Attention}(\textcolor{#e11d48}{Q}, \textcolor{#2563eb}{K}, \textcolor{#059669}{V}) = \text{softmax}\!\left(\frac{\textcolor{#e11d48}{Q}\textcolor{#2563eb}{K}^T}{\sqrt{\textcolor{#d97706}{d_k}}}\right)\textcolor{#059669}{V}$$
√dk Square root of the key dimension. If keys are 64-dimensional, this is √64 = 8.

Why √dk specifically?

If the entries of Q and K are independent random variables with mean 0 and variance 1, then each dot product \(q \cdot k = \sum_{i=1}^{d_k} q_i k_i\) has mean 0 and variance \(d_k\). Dividing by \(\sqrt{d_k}\) brings the variance back to 1, regardless of dimension.

Connection to prior chapters: This scaling follows the same principle as the z-score from Ch02: z-score normalization: subtract the mean (here, already 0) and divide by the standard deviation (\(\sqrt{d_k}\)) to normalize values to a standard range. The goal in both cases is to prevent scale from distorting the meaningful signal.

4. Self-Attention

Think of it this way: In "The bank by the river was eroding," how does the model know "bank" means a riverbank and not a financial institution? Self-attention lets every word look at every other word in the same sentence. "Bank" attends to "river" and "eroding," gathering context to disambiguate its meaning. This creates a different representation for "bank" in every sentence it appears in.

Self-Attention: Projecting Q, K, V from the Same Input
$$\textcolor{#e11d48}{Q} = \textcolor{#7c3aed}{X}\textcolor{#e11d48}{W^Q}, \quad \textcolor{#2563eb}{K} = \textcolor{#7c3aed}{X}\textcolor{#2563eb}{W^K}, \quad \textcolor{#059669}{V} = \textcolor{#7c3aed}{X}\textcolor{#059669}{W^V}$$
X Input matrix: each row is one token's embedding (n tokens × dmodel dimensions)
WQ, WK, WV Learned projection matrices that transform the same input into three different "views"
Connection to prior chapters: Self-attention creates contextual embeddings — each word gets a different vector depending on its sentence. This fundamentally differs from the static embeddings of Ch07: Word2Vec, where "bank" always has the same vector regardless of context. The attention weight matrix also acts like a learned, dynamic co-occurrence matrix — recalling the PMI co-occurrence matrices of Ch01, but computed on the fly for each input.

5. Positional Encoding

Think of it this way: Self-attention treats the input as a set, not a sequence — "dog bites man" and "man bites dog" would produce the same attention weights. Positional encoding adds a unique signal to each position so the model can distinguish word order. The sinusoidal pattern creates a kind of "binary clock" where different frequencies tick at different rates, giving each position a unique fingerprint.

Sinusoidal Positional Encoding
$$PE_{(\textcolor{#e11d48}{pos},\, \textcolor{#2563eb}{2i})} = \sin\!\left(\frac{\textcolor{#e11d48}{pos}}{10000^{\textcolor{#2563eb}{2i}/\textcolor{#059669}{d_{\text{model}}}}}\right)$$ $$PE_{(\textcolor{#e11d48}{pos},\, \textcolor{#2563eb}{2i+1})} = \cos\!\left(\frac{\textcolor{#e11d48}{pos}}{10000^{\textcolor{#2563eb}{2i}/\textcolor{#059669}{d_{\text{model}}}}}\right)$$
pos The position of the token in the sequence (0, 1, 2, ...)
i The dimension index within the embedding (0, 1, 2, ...)
dmodel The total embedding dimension (e.g., 512 in the original transformer)

How the Frequencies Work

Dimension 0 oscillates rapidly (wavelength = 2π), while dimension 511 oscillates very slowly (wavelength = 2π × 10000). This means nearby positions differ mostly in the high-frequency dimensions, while distant positions differ across all dimensions. The model can learn to read these patterns to understand both absolute position and relative distance.

Connection to prior chapters: Ch06: N-gram language models capture word order inherently through the Markov assumption — each probability is conditioned on the previous \(n{-}1\) words in order. Transformers, by contrast, process all tokens in parallel and have no built-in notion of sequence. Positional encoding is the explicit mechanism that replaces the implicit ordering of n-gram models.

6. Multi-Head Attention

Think of it this way: A single attention head can only focus on one type of relationship at a time. Multi-head attention runs several attention operations in parallel, each with its own learned projections. One head might learn syntactic dependencies (subject-verb), another might learn semantic similarity, and another might track positional patterns. The results are concatenated and projected to form a richer representation.

Multi-Head Attention
$$\text{MultiHead}(\textcolor{#e11d48}{Q}, \textcolor{#2563eb}{K}, \textcolor{#059669}{V}) = \text{Concat}(\textcolor{#d97706}{\text{head}_1}, \ldots, \textcolor{#d97706}{\text{head}_h})\,\textcolor{#7c3aed}{W^O}$$ $$\text{where } \textcolor{#d97706}{\text{head}_i} = \text{Attention}(\textcolor{#e11d48}{Q}W_i^Q,\; \textcolor{#2563eb}{K}W_i^K,\; \textcolor{#059669}{V}W_i^V)$$
headi The output of the i-th attention head, each operating on a different learned subspace
WO Output projection matrix that combines all heads back into dmodel dimensions
h Number of heads (typically 8 or 16 in standard transformers)
Connection to prior chapters: Running multiple attention heads in parallel is analogous to computing multiple association measures simultaneously — just as Ch02 showed that different measures (t-test, chi-squared, log-likelihood) capture different aspects of word relationships, each attention head captures a different type of token relationship. The model learns which aspects to combine.

7. Layer Normalization

Think of it this way: As data flows through many transformer layers, the activations can drift to very large or very small values, making training unstable. Layer normalization recenters and rescales each layer's output to have zero mean and unit variance — exactly the z-score you learned in Chapter 2, plus learnable parameters that let the model undo the normalization if needed.

Layer Normalization
$$\text{LayerNorm}(\textcolor{#2563eb}{x}) = \textcolor{#d97706}{\gamma} \cdot \frac{\textcolor{#2563eb}{x} - \textcolor{#e11d48}{\mu}}{\textcolor{#059669}{\sigma}} + \textcolor{#7c3aed}{\beta}$$
x The activation vector to normalize
μ Mean of x across the feature dimension
σ Standard deviation of x across the feature dimension
γ Learnable scale parameter (initialized to 1)
β Learnable shift parameter (initialized to 0)
Connection to prior chapters: The core of this formula — \(\frac{x - \mu}{\sigma}\) — is the z-score from Ch02: z-score. In Chapter 2, z-scores normalized co-occurrence counts to identify statistically surprising word pairs. Here, the same normalization stabilizes the flow of information through dozens or hundreds of transformer layers. The learnable \(\gamma\) and \(\beta\) are the only addition — they let the network learn the optimal scale and shift for each layer.

8. The Transformer Block (Putting It Together)

Think of it this way: The transformer block is a processing pipeline repeated many times (6 layers in the original paper, up to 96+ in modern models like GPT-4). Each block takes a sequence of token representations, lets them communicate via self-attention, processes each token individually through a feed-forward network, and normalizes the results. Residual connections (skip connections) ensure that information and gradients can flow freely even through very deep stacks.

Transformer Block (Post-Norm Variant)
$$\textcolor{#d97706}{a} = \text{LayerNorm}\!\Big(\textcolor{#7c3aed}{x} + \text{MultiHeadAttn}(\textcolor{#7c3aed}{x})\Big)$$ $$\textcolor{#059669}{\text{output}} = \text{LayerNorm}\!\Big(\textcolor{#d97706}{a} + \text{FFN}(\textcolor{#d97706}{a})\Big)$$
x Input to the block: token embeddings + positional encoding (first block) or output of the previous block
x + MultiHeadAttn(x) Residual connection: the original input is added back to the attention output, ensuring nothing is lost
FFN(a) Feed-forward network: two linear layers with a ReLU/GELU activation in between, applied to each token independently
What the full pipeline looks like:
  1. Input Embedding — Look up each token in a learned embedding table (Ch07)
  2. Positional Encoding — Add position information to embeddings (this appendix, concept 5)
  3. Multi-Head Self-Attention — Tokens attend to each other (this appendix, concepts 2–6)
  4. Add & Layer Norm — Residual connection + z-score normalization (Ch02)
  5. Feed-Forward Network — Per-token nonlinear transformation
  6. Add & Layer Norm — Another residual + normalization
  7. Output — Logits → softmax → probabilities, trained with cross-entropy loss (Ch04)

Interactive Demos

Explore the key transformer components hands-on. Each demo connects back to formulas from earlier chapters.

Demo 1: Softmax Explorer

Enter logit values and adjust the temperature to see how softmax shapes probability distributions.

Demo 2: Attention Matrix Visualizer

Enter a sentence and see the self-attention heatmap: which tokens attend to which. Hover over cells for exact weights.

Demo 3: Transformer Block Diagram

An interactive diagram of the full transformer block. Hover over each component to see its formula and which chapter it connects to. Click to jump to that chapter.

The Full Connection Map

Every major transformer component traces back to a formula you have already learned. This table is the Rosetta Stone connecting classical text analysis to modern deep learning:

Transformer Component Formula Used From Chapter
Training Loss Cross-Entropy Ch04: Information Theory
Model Evaluation Perplexity Ch04: Information Theory
Similarity Scoring Dot Product / Cosine Ch05: Similarity & Distance
Score → Probability Softmax This Appendix
Attention Weights Scaled Dot-Product This Appendix
Input Representations Word Embeddings Ch07: Word Embeddings
Contextual Embeddings Self-Attention This Appendix
Weight Normalization Layer Norm (z-score) Ch02: Co-occurrence & Association
What Transformers Replaced N-gram Language Models Ch06: Language Models
Word Association (implicit) PMI Matrix Ch01: PMI and Its Variants
Term Importance (analogous) TF-IDF Weighting Ch03: TF-IDF Family

Summary

The transformer is not a single monolithic invention. It is a carefully composed stack of mathematical operations — every one of which has roots in the classical text analysis formulas covered throughout this manual. Softmax normalizes scores into probabilities. Dot products measure similarity. Layer normalization applies z-scores. Cross-entropy drives the training signal. Embeddings encode meaning.

The Big Picture

Every formula in this manual is alive inside modern transformers. PMI matrices became embeddings. TF-IDF became attention. Z-scores became layer normalization. Cross-entropy became the training signal. The journey from counting word pairs to ChatGPT is shorter than you think.

What Changed, What Stayed

What changed is scale and learning: instead of hand-designed features, transformers learn their representations end-to-end from data, stacking these operations billions of times. What stayed is the mathematics: the same formulas you can compute by hand on a napkin are running trillions of times per second in every inference call to a modern language model.