Introduction
In Part 1, I explained what GPT-2 does at a high level: tokenization, embeddings, attention, and the autoregressive generation loop. Now we're going to build it.
By the end of this article, you'll understand every component of GPT-2, see the math behind each operation, and have working PyTorch code you can run yourself. All the code is available on GitHub.
Here's what we're building:
| Parameter | Value |
|---|---|
| Transformer Blocks | 12 |
| Embedding Dimension | 768 |
| Attention Heads | 12 |
| Head Dimension | 64 |
| Context Length | 1024 |
| Vocabulary Size | 50,257 |
| FFN Hidden Dimension | 3072 |
| Total Parameters | ~124M |
Prerequisites: Basic PyTorch (tensors, nn.Module, forward/backward), linear algebra basics (matrix multiplication), and Python.
We'll build each piece, then assemble them into the complete model. Let's go.
Token and Positional Embeddings
Token Embeddings
Intuition: Token IDs are arbitrary integers (15496, 995, etc.). Neural networks need continuous values to compute gradients. Solution: a lookup table.
Math:
E[token_id] → vector of size n_embd
Code:
import torch.nn as nn
# Token embedding is just nn.Embedding
self.wte = nn.Embedding(vocab_size, n_embd)
# Usage:
tok_emb = self.wte(token_ids) # (B, T) -> (B, T, n_embd)
nn.Embedding creates a matrix and does row lookup. For GPT-2: 50,257 × 768 = 38.6M parameters.
Positional Embeddings
Intuition: "dog bites man" ≠ "man bites dog". We need to encode position. GPT-2 uses learned (not sinusoidal) positions.
Math:
P[position] → vector of size n_embd
Code:
self.wpe = nn.Embedding(block_size, n_embd)
# Usage:
positions = torch.arange(T, device=device) # [0, 1, 2, ..., T-1]
pos_emb = self.wpe(positions) # (T,) -> (T, n_embd)
For GPT-2: 1,024 × 768 = 786K parameters.
Combining Them
def forward(self, idx):
B, T = idx.shape
tok_emb = self.wte(idx) # (B, T, n_embd)
pos_emb = self.wpe(torch.arange(T, device=idx.device)) # (T, n_embd)
x = tok_emb + pos_emb # (B, T, n_embd) via broadcasting
return x
Why addition and not concatenation? It keeps the dimension at 768 (instead of doubling it), and empirically works just as well. Broadcasting handles the batch dimension automatically.
Self-Attention: The Core
This is the heart of the transformer. Take your time with this section.
The Problem Attention Solves
Consider: "The cat sat on the mat because it was soft."
When processing "it" (position 7), we need information from "mat" (position 5). Attention lets every position directly look at every other position.
Query, Key, Value
Intuition: Each token plays three roles:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
Q and K determine the attention weight; V provides the content.
Math:
We multiply the input X by three different weight matrices to create Query, Key, and Value. Think of it as transforming the same input three different ways, like taking a photo and applying three different filters.
Query (Q): Q = X @ WQ
Shape: (batch_size, tokens, embedding_dim) → (batch_size, tokens, head_dimension)
Step 2: Multiply input X by Key weights → K
Key (K): K = X @ WK (same process, different weights)
Step 3: Multiply input X by Value weights → V
Value (V): V = X @ WV (same process, different weights)
Result: Each token now has Q, K, and V vectors that represent different aspects of the same information.
Computing Attention Scores
Step 1: Dot product Q and K
We multiply Query and Key (transposed) to compute how much each token should pay attention to every other token. This creates a matrix of attention scores.
scores = Q @ KT
Shape: (batch, tokens, head_dim) × (batch, head_dim, tokens) → (batch, tokens, tokens)
Result: A matrix where each cell (i, j) shows how much token i pays attention to token j.
Think of it as: "How relevant is word j to word i?"
Step 2: Scale by √d_head
Why divide by the square root? This was my biggest "aha" moment. Here's why we need it: Q and K elements have variance ~1 (after proper initialization). The dot product of d-dimensional vectors has variance d, so as d_head increases, the values grow larger. Large values cause softmax to saturate (become too peaked at one value, near-zero everywhere else), which makes gradients vanish and hurts learning. Dividing by √d_head keeps variance at 1, normalizing the scores to a reasonable range so softmax can produce smoother, more balanced attention weights. For GPT-2: d_head = 64, so we divide by 8.
scores = scores / √d_head
Why √d_head? Dot products scale with dimension: larger dimensions mean larger values. Dividing by √d_head keeps the variance constant, preventing softmax from becoming too extreme and allowing the model to learn better attention patterns.
Step 3: Apply causal mask (set future positions to -∞)
Step 4: Softmax (each row sums to 1)
Softmax converts raw scores into probabilities. It makes all values positive and ensures each row sums to 1, so we get a proper probability distribution. The highest scores become the most probable, but nothing gets completely ignored.
weights = softmax(scores, dim=-1)
Shape: (batch, tokens, tokens) - stays the same
What softmax does: Takes each row of scores and converts them to probabilities that sum to 1.0
Result: Each token now has a probability distribution showing how much attention it should pay to every other token.
Step 5: Weighted sum of values
Finally, we use the attention weights to create a weighted combination of the Value vectors. This is where the "attention" happens: tokens with higher attention weights contribute more to the final output.
output = weights @ V
Shape: (batch, tokens, tokens) × (batch, tokens, head_dim) → (batch, tokens, head_dim)
What this does: For each token, we take a weighted average of all Value vectors, where the weights come from the attention probabilities we computed.
Result: Each token now has a new representation that incorporates information from all tokens it paid attention to.
The complete formula:
Here's the full attention mechanism in one line. It combines all the steps we just walked through: compute similarity scores between Query and Key, scale them, convert to probabilities with softmax, then use those probabilities to weight the Value vectors.
Attention(Q, K, V) = softmax(Q @ KT / √d_k) @ V
Breaking it down:
• Q @ KT: Compute how similar each Query is to each Key
• / √d_k: Scale to prevent extreme values
• softmax(...): Convert to probability distribution
• @ V: Weight and combine the Value vectors
Final output: A new representation for each token that incorporates relevant information from other tokens based on attention weights.
Causal Masking
Why: Autoregressive models can't see future tokens during training or inference.
Autoregressive models generate text one token at a time, left to right. During inference, they can't see future tokens because they haven't been generated yet. During training, we enforce this same constraint so the model learns to predict based only on what came before, matching how it will actually be used. This is why we use causal masking: it prevents the model from "cheating" by looking ahead.
How:
# Create mask once
mask = torch.tril(torch.ones(block_size, block_size))
# Apply mask: where mask = 0, set score to -inf
scores = scores.masked_fill(mask[:T, :T] == 0, float('-inf'))
# After softmax: e^(-inf) = 0
Multi-Head Attention
GPT-2 uses 12 heads, each with d_head = 64 (12 × 64 = 768). Each head learns different patterns. Outputs are concatenated, then projected.
Full Attention Code
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# Combined Q, K, V projection (more efficient than separate)
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
# Output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.head_dim = config.n_embd // config.n_head
# Causal mask (registered as buffer, not parameter)
self.register_buffer('bias',
torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))
def forward(self, x):
B, T, C = x.size()
# Compute Q, K, V in one matrix multiply
qkv = self.c_attn(x) # (B, T, 3*C)
q, k, v = qkv.split(self.n_embd, dim=2) # Each: (B, T, C)
# Reshape for multi-head: (B, n_head, T, head_dim)
q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
# Attention scores: (B, n_head, T, T)
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(self.head_dim))
# Causal mask
att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
# Softmax
att = F.softmax(att, dim=-1)
# Apply to values: (B, n_head, T, head_dim)
y = att @ v
# Concatenate heads: (B, T, C)
y = y.transpose(1, 2).contiguous().view(B, T, C)
# Output projection
return self.c_proj(y)
c_attncombines W_Q, W_K, W_V for efficiency.view()and.transpose()reshape for multi-headregister_buffersaves the mask with the model but doesn't train it.contiguous()is needed before.view()after transpose
Feed-Forward Network (MLP)
Intuition: Attention gathers information from context. The MLP processes each position independently, "thinking about" what was gathered.
Architecture
- Expand: 768 → 3072 (4× expansion)
- Nonlinearity: GELU
- Contract: 3072 → 768
Why Expand Then Contract?
More "working memory" for computation. Like scratch paper bigger than the final answer. The 4× factor is empirical; it's what worked well.
GELU vs ReLU
GELU(x) ≈ x · σ(1.702x) (Smooth approximation)
GELU has smoother transitions and empirically works slightly better for language modeling.
Code
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.gelu = nn.GELU(approximate='tanh')
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
def forward(self, x):
x = self.c_fc(x) # (B, T, 768) -> (B, T, 3072)
x = self.gelu(x)
x = self.c_proj(x) # (B, T, 3072) -> (B, T, 768)
return x
Parameter Count
- c_fc: 768 × 3072 = 2.36M
- c_proj: 3072 × 768 = 2.36M
- Total per MLP: ~4.7M
- 12 blocks × 4.7M = 56.6M (45% of the model!)
Transformer Block
Now we combine attention + MLP with residual connections and layer normalization.
Layer Normalization
Normalizes each position to mean 0, variance 1. Learned scale (γ) and shift (β). Stabilizes deep network training.
# PyTorch does this for us
self.ln_1 = nn.LayerNorm(n_embd)
Residual Connections
x = x + sublayer(x) # Not x = sublayer(x)
Why?
- Gradient flow: Direct path backward through addition
- Easier learning: Layer only learns the difference from identity
Pre-LN Architecture (GPT-2 Style)
x = x + attention(layernorm(x)) # Pre-norm
x = x + mlp(layernorm(x))
(vs. Post-norm: x = layernorm(x + attention(x)))
Pre-LN is more stable for deep networks.
Full Block Code
class TransformerBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
# Attention with residual
x = x + self.attn(self.ln_1(x))
# MLP with residual
x = x + self.mlp(self.ln_2(x))
return x
Parameter Count Per Block
| Component | Parameters |
|---|---|
| ln_1 (γ, β) | 2 × 768 = 1,536 |
| Attention | ~2.36M |
| ln_2 (γ, β) | 1,536 |
| MLP | ~4.7M |
| Total | ~7.08M |
12 blocks × 7.08M = ~85M parameters
Full Model Assembly
Architecture Overview
Embeddings (wte + wpe)
|
Block 1 -> Block 2 -> ... -> Block 12
|
Final LayerNorm (ln_f)
|
Output Projection (lm_head)
Weight Tying
self.lm_head.weight = self.transformer.wte.weight
Why? Semantic consistency: if "cat" has embedding v, producing "cat" should mean the hidden state is close to v. Also saves 38.6M parameters!
Full Model Code
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd),
wpe = nn.Embedding(config.block_size, config.n_embd),
h = nn.ModuleList([TransformerBlock(config)
for _ in range(config.n_layer)]),
ln_f = nn.LayerNorm(config.n_embd),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
# Weight tying
self.transformer.wte.weight = self.lm_head.weight
def forward(self, idx, targets=None):
B, T = idx.shape
# Embeddings
tok_emb = self.transformer.wte(idx)
pos_emb = self.transformer.wpe(torch.arange(T, device=idx.device))
x = tok_emb + pos_emb
# Transformer blocks
for block in self.transformer.h:
x = block(x)
# Final norm and projection
x = self.transformer.ln_f(x)
logits = self.lm_head(x) # (B, T, vocab_size)
# Loss
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
return logits, loss
Parameter Count Verification
| Component | Parameters |
|---|---|
| wte | 50,257 × 768 = 38,597,376 |
| wpe | 1,024 × 768 = 786,432 |
| 12 Blocks | 12 × 7,087,872 = 85,054,464 |
| ln_f | 2 × 768 = 1,536 |
| lm_head | Tied = 0 |
| Total | 124,439,808 |
Training
Cross-Entropy Loss
Intuition: How wrong are our predictions?
Examples:
- P("fox") = 0.80 → loss = 0.22 (good)
- P("fox") = 0.15 → loss = 1.90 (bad)
loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))
AdamW Optimizer
Why not plain SGD?
- Momentum: Smooth out noisy gradients
- Adaptive LR: Different learning rates per parameter
- Weight decay: Prevent weights from growing too large
optimizer = torch.optim.AdamW(
model.parameters(),
lr=6e-4,
betas=(0.9, 0.95),
weight_decay=0.1
)
Learning Rate Schedule
Warmup (first ~715 steps): Start at 0, increase linearly to max_lr. Prevents early instability.
Cosine Decay (after warmup): Decrease from max_lr to min_lr (10% of max). Smooth decay helps convergence.
def get_lr(step, warmup_steps=715, max_lr=6e-4, min_lr=6e-5, max_steps=19073):
if step < warmup_steps:
return max_lr * step / warmup_steps
if step > max_steps:
return min_lr
progress = (step - warmup_steps) / (max_steps - warmup_steps)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
Gradient Accumulation
Problem: Batch size 512 × 1024 tokens = 524K tokens. Doesn't fit in GPU memory.
Solution: Accumulate gradients over multiple smaller batches.
grad_accum_steps = 32
optimizer.zero_grad()
for micro_step in range(grad_accum_steps):
x, y = get_batch()
logits, loss = model(x, y)
(loss / grad_accum_steps).backward() # Scale for averaging
optimizer.step()
Gradient Clipping
Why: Prevent exploding gradients.
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Text Generation
The Autoregressive Loop
@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
# Crop to context length
idx_cond = idx[:, -model.config.block_size:]
# Get predictions
logits, _ = model(idx_cond)
logits = logits[:, -1, :] / temperature
# Top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = float('-inf')
# Sample
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
# Append
idx = torch.cat([idx, idx_next], dim=1)
return idx
Sampling Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Greedy | Always pick highest prob | Factual content |
| Temperature | Scale logits before softmax | Creativity control |
| Top-k | Only consider top k tokens | Prevent weird outliers |
| Top-p | Consider until cumsum = p | Adaptive confidence |
Temperature intuition: T < 1 = more deterministic, T = 1 = original probs, T > 1 = more random.
Results and Samples
Training Details
- Dataset: Shakespeare (~1MB)
- Steps: 1000
- Hardware: Single GPU
Sample Outputs
Temperature 0.7:
ROMEO:
What say'st thou? Speak again, bright angel, for thou art
As glorious to this night, being o'er my head,
As is a winged messenger of heaven
Unto the white-upturned wondering eyes
Of mortals that fall back to gaze on him...
Temperature 1.0:
ROMEO:
O, she doth teach the torches to burn bright!
It seems she hangs upon the cheek of night
Like a rich jewel in an Ethiope's ear;
Beauty too rich for use, for earth too dear!
Temperature 1.5:
ROMEO:
By my head, here come the Capulets strange matter!
O serpent heart, hid with a flowering face!
Did ever dragon keep so fair a cave?
Beautiful tyrant! fiend angelical!
What It Learned
- Iambic pentameter rhythm
- Character voices and names
- Dramatic sentence structure
- Shakespearean vocabulary
All from next-token prediction on ~1MB of text!
Summary and Resources
What We Built
- Embeddings: Token IDs → vectors
- Attention: Let positions communicate
- MLP: Process each position
- Residuals + LayerNorm: Enable depth
- Training: Cross-entropy, AdamW, cosine schedule
- Generation: Autoregressive sampling
GitHub Repository
github.com/prathyusha1231/gpt2-shakespeare
Contents:
- Complete model code (~500 lines)
- Training script
- Tests matching HuggingFace
- Documentation for each component
Resources
- Andrej Karpathy's build-nanogpt
- Attention Is All You Need (Original Transformer paper)
- GPT-2 Paper
Closing Thought
The architecture is simple. That's what makes it powerful: the same structure scales from 124M to hundreds of billions of parameters. The elegance is in the simplicity: tokenize, embed, attend, project, sample, repeat.
Missed the Intuition?
Check out Part 1 for a no-math explanation of how GPT-2 works.
Read Part 1: What GPT-2 Actually Does