Building GPT-2 (124M) from Scratch in PyTorch

Introduction

In Part 1, I explained what GPT-2 does at a high level: tokenization, embeddings, attention, and the autoregressive generation loop. Now we're going to build it.

By the end of this article, you'll understand every component of GPT-2, see the math behind each operation, and have working PyTorch code you can run yourself. All the code is available on GitHub.

Here's what we're building:

Parameter	Value
Transformer Blocks	12
Embedding Dimension	768
Attention Heads	12
Head Dimension	64
Context Length	1024
Vocabulary Size	50,257
FFN Hidden Dimension	3072
Total Parameters	~124M

Prerequisites: Basic PyTorch (tensors, nn.Module, forward/backward), linear algebra basics (matrix multiplication), and Python.

We'll build each piece, then assemble them into the complete model. Let's go.

Token and Positional Embeddings

Token Embeddings

Intuition: Token IDs are arbitrary integers (15496, 995, etc.). Neural networks need continuous values to compute gradients. Solution: a lookup table.

Math:

E ∈ R^{(vocab_size × n_embd)}
E[token_id] → vector of size n_embd

Code:

import torch.nn as nn

# Token embedding is just nn.Embedding
self.wte = nn.Embedding(vocab_size, n_embd)

# Usage:
tok_emb = self.wte(token_ids)  # (B, T) -> (B, T, n_embd)

nn.Embedding creates a matrix and does row lookup. For GPT-2: 50,257 × 768 = 38.6M parameters.

Positional Embeddings

Intuition: "dog bites man" ≠ "man bites dog". We need to encode position. GPT-2 uses learned (not sinusoidal) positions.

Math:

P ∈ R^{(block_size × n_embd)}
P[position] → vector of size n_embd

Code:

self.wpe = nn.Embedding(block_size, n_embd)

# Usage:
positions = torch.arange(T, device=device)  # [0, 1, 2, ..., T-1]
pos_emb = self.wpe(positions)  # (T,) -> (T, n_embd)

For GPT-2: 1,024 × 768 = 786K parameters.

Combining Them

def forward(self, idx):
    B, T = idx.shape
    tok_emb = self.wte(idx)           # (B, T, n_embd)
    pos_emb = self.wpe(torch.arange(T, device=idx.device))  # (T, n_embd)
    x = tok_emb + pos_emb             # (B, T, n_embd) via broadcasting
    return x

Why addition and not concatenation? It keeps the dimension at 768 (instead of doubling it), and empirically works just as well. Broadcasting handles the batch dimension automatically.

Self-Attention: The Core

This is the heart of the transformer. Take your time with this section.

The Problem Attention Solves

Consider: "The cat sat on the mat because it was soft."

When processing "it" (position 7), we need information from "mat" (position 5). Attention lets every position directly look at every other position.

Query, Key, Value

Intuition: Each token plays three roles:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

Q and K determine the attention weight; V provides the content.

Math:

We multiply the input X by three different weight matrices to create Query, Key, and Value. Think of it as transforming the same input three different ways, like taking a photo and applying three different filters.

Step 1: Multiply input X by Query weights → Q
Query (Q): Q = X @ W_Q
Shape: (batch_size, tokens, embedding_dim) → (batch_size, tokens, head_dimension)

Step 2: Multiply input X by Key weights → K
Key (K): K = X @ W_K (same process, different weights)

Step 3: Multiply input X by Value weights → V
Value (V): V = X @ W_V (same process, different weights)

Result: Each token now has Q, K, and V vectors that represent different aspects of the same information.

Computing Attention Scores

Step 1: Dot product Q and K

We multiply Query and Key (transposed) to compute how much each token should pay attention to every other token. This creates a matrix of attention scores.

Attention Scores: Multiply Query and Key (transposed)
scores = Q @ K^T
Shape: (batch, tokens, head_dim) × (batch, head_dim, tokens) → (batch, tokens, tokens)

Result: A matrix where each cell (i, j) shows how much token i pays attention to token j.
Think of it as: "How relevant is word j to word i?"

Step 2: Scale by √d_head

Why divide by the square root? This was my biggest "aha" moment. Here's why we need it: Q and K elements have variance ~1 (after proper initialization). The dot product of d-dimensional vectors has variance d, so as d_head increases, the values grow larger. Large values cause softmax to saturate (become too peaked at one value, near-zero everywhere else), which makes gradients vanish and hurts learning. Dividing by √d_head keeps variance at 1, normalizing the scores to a reasonable range so softmax can produce smoother, more balanced attention weights. For GPT-2: d_head = 64, so we divide by 8.

Scale the scores: Divide by square root of head dimension
scores = scores / √d_head

Why √d_head? Dot products scale with dimension: larger dimensions mean larger values. Dividing by √d_head keeps the variance constant, preventing softmax from becoming too extreme and allowing the model to learn better attention patterns.

Step 3: Apply causal mask (set future positions to -∞)

scores[future_positions] = -∞

Step 4: Softmax (each row sums to 1)

Softmax converts raw scores into probabilities. It makes all values positive and ensures each row sums to 1, so we get a proper probability distribution. The highest scores become the most probable, but nothing gets completely ignored.

Convert scores to probabilities: Apply softmax function
weights = softmax(scores, dim=-1)
Shape: (batch, tokens, tokens) - stays the same

What softmax does: Takes each row of scores and converts them to probabilities that sum to 1.0
Result: Each token now has a probability distribution showing how much attention it should pay to every other token.

Step 5: Weighted sum of values

Finally, we use the attention weights to create a weighted combination of the Value vectors. This is where the "attention" happens: tokens with higher attention weights contribute more to the final output.

Combine values using attention weights: Multiply weights by Value vectors
output = weights @ V
Shape: (batch, tokens, tokens) × (batch, tokens, head_dim) → (batch, tokens, head_dim)

What this does: For each token, we take a weighted average of all Value vectors, where the weights come from the attention probabilities we computed.
Result: Each token now has a new representation that incorporates information from all tokens it paid attention to.

The complete formula:

Here's the full attention mechanism in one line. It combines all the steps we just walked through: compute similarity scores between Query and Key, scale them, convert to probabilities with softmax, then use those probabilities to weight the Value vectors.

Complete Attention Mechanism:
Attention(Q, K, V) = softmax(Q @ K^T / √d_k) @ V

Breaking it down:
• Q @ K^T: Compute how similar each Query is to each Key
• / √d_k: Scale to prevent extreme values
• softmax(...): Convert to probability distribution
• @ V: Weight and combine the Value vectors

Final output: A new representation for each token that incorporates relevant information from other tokens based on attention weights.

Causal Masking

Why: Autoregressive models can't see future tokens during training or inference.

Remember

Autoregressive models generate text one token at a time, left to right. During inference, they can't see future tokens because they haven't been generated yet. During training, we enforce this same constraint so the model learns to predict based only on what came before, matching how it will actually be used. This is why we use causal masking: it prevents the model from "cheating" by looking ahead.

How:

# Create mask once
mask = torch.tril(torch.ones(block_size, block_size))

# Apply mask: where mask = 0, set score to -inf
scores = scores.masked_fill(mask[:T, :T] == 0, float('-inf'))

# After softmax: e^(-inf) = 0

Multi-Head Attention

GPT-2 uses 12 heads, each with d_head = 64 (12 × 64 = 768). Each head learns different patterns. Outputs are concatenated, then projected.

Full Attention Code

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0

        # Combined Q, K, V projection (more efficient than separate)
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)

        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_dim = config.n_embd // config.n_head

        # Causal mask (registered as buffer, not parameter)
        self.register_buffer('bias',
            torch.tril(torch.ones(config.block_size, config.block_size))
            .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()

        # Compute Q, K, V in one matrix multiply
        qkv = self.c_attn(x)  # (B, T, 3*C)
        q, k, v = qkv.split(self.n_embd, dim=2)  # Each: (B, T, C)

        # Reshape for multi-head: (B, n_head, T, head_dim)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)

        # Attention scores: (B, n_head, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(self.head_dim))

        # Causal mask
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))

        # Softmax
        att = F.softmax(att, dim=-1)

        # Apply to values: (B, n_head, T, head_dim)
        y = att @ v

        # Concatenate heads: (B, T, C)
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        # Output projection
        return self.c_proj(y)

Implementation Notes

c_attn combines W_Q, W_K, W_V for efficiency
.view() and .transpose() reshape for multi-head
register_buffer saves the mask with the model but doesn't train it
.contiguous() is needed before .view() after transpose

Feed-Forward Network (MLP)

Intuition: Attention gathers information from context. The MLP processes each position independently, "thinking about" what was gathered.

Architecture

Expand: 768 → 3072 (4× expansion)
Nonlinearity: GELU
Contract: 3072 → 768

Why Expand Then Contract?

More "working memory" for computation. Like scratch paper bigger than the final answer. The 4× factor is empirical; it's what worked well.

GELU vs ReLU

ReLU(x) = max(0, x) (Hard cutoff)
GELU(x) ≈ x · σ(1.702x) (Smooth approximation)

GELU has smoother transitions and empirically works slightly better for language modeling.

Comparison of GELU and ReLU activation functions — GELU (solid) vs ReLU (dashed). GELU's smooth curve near zero provides better gradient flow during training.

Code

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu = nn.GELU(approximate='tanh')
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        x = self.c_fc(x)      # (B, T, 768) -> (B, T, 3072)
        x = self.gelu(x)
        x = self.c_proj(x)    # (B, T, 3072) -> (B, T, 768)
        return x

Parameter Count

c_fc: 768 × 3072 = 2.36M
c_proj: 3072 × 768 = 2.36M
Total per MLP: ~4.7M
12 blocks × 4.7M = 56.6M (45% of the model!)

Transformer Block

Now we combine attention + MLP with residual connections and layer normalization.

Layer Normalization

Normalizes each position to mean 0, variance 1. Learned scale (γ) and shift (β). Stabilizes deep network training.

# PyTorch does this for us
self.ln_1 = nn.LayerNorm(n_embd)

Residual Connections

x = x + sublayer(x)   # Not x = sublayer(x)

Why?

Gradient flow: Direct path backward through addition
Easier learning: Layer only learns the difference from identity

Pre-LN Architecture (GPT-2 Style)

x = x + attention(layernorm(x))   # Pre-norm
x = x + mlp(layernorm(x))

(vs. Post-norm: x = layernorm(x + attention(x)))

Pre-LN is more stable for deep networks.

Full Block Code

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        # Attention with residual
        x = x + self.attn(self.ln_1(x))
        # MLP with residual
        x = x + self.mlp(self.ln_2(x))
        return x

Parameter Count Per Block

Component	Parameters
ln_1 (γ, β)	2 × 768 = 1,536
Attention	~2.36M
ln_2 (γ, β)	1,536
MLP	~4.7M
Total	~7.08M

12 blocks × 7.08M = ~85M parameters

Full Model Assembly

Architecture Overview

Embeddings (wte + wpe)
    |
Block 1 -> Block 2 -> ... -> Block 12
    |
Final LayerNorm (ln_f)
    |
Output Projection (lm_head)

Weight Tying

self.lm_head.weight = self.transformer.wte.weight

Why? Semantic consistency: if "cat" has embedding v, producing "cat" should mean the hidden state is close to v. Also saves 38.6M parameters!

Full Model Code

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            h = nn.ModuleList([TransformerBlock(config)
                               for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # Weight tying
        self.transformer.wte.weight = self.lm_head.weight

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # Embeddings
        tok_emb = self.transformer.wte(idx)
        pos_emb = self.transformer.wpe(torch.arange(T, device=idx.device))
        x = tok_emb + pos_emb

        # Transformer blocks
        for block in self.transformer.h:
            x = block(x)

        # Final norm and projection
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        # Loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

Parameter Count Verification

Component	Parameters
wte	50,257 × 768 = 38,597,376
wpe	1,024 × 768 = 786,432
12 Blocks	12 × 7,087,872 = 85,054,464
ln_f	2 × 768 = 1,536
lm_head	Tied = 0
Total	124,439,808

Training

Cross-Entropy Loss

Intuition: How wrong are our predictions?

loss = -log(P(correct_token))

Examples:

P("fox") = 0.80 → loss = 0.22 (good)
P("fox") = 0.15 → loss = 1.90 (bad)

loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

AdamW Optimizer

Why not plain SGD?

Momentum: Smooth out noisy gradients
Adaptive LR: Different learning rates per parameter
Weight decay: Prevent weights from growing too large

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=6e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1
)

Learning Rate Schedule

Warmup (first ~715 steps): Start at 0, increase linearly to max_lr. Prevents early instability.

Cosine Decay (after warmup): Decrease from max_lr to min_lr (10% of max). Smooth decay helps convergence.

def get_lr(step, warmup_steps=715, max_lr=6e-4, min_lr=6e-5, max_steps=19073):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    if step > max_steps:
        return min_lr
    progress = (step - warmup_steps) / (max_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

Learning rate schedule with warmup and cosine decay — Learning rate over training. Linear warmup prevents early instability, cosine decay helps fine-tune toward convergence.

Gradient Accumulation

Problem: Batch size 512 × 1024 tokens = 524K tokens. Doesn't fit in GPU memory.

Solution: Accumulate gradients over multiple smaller batches.

grad_accum_steps = 32
optimizer.zero_grad()

for micro_step in range(grad_accum_steps):
    x, y = get_batch()
    logits, loss = model(x, y)
    (loss / grad_accum_steps).backward()  # Scale for averaging

optimizer.step()

Gradient Clipping

Why: Prevent exploding gradients.

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Text Generation

The Autoregressive Loop

@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None):
    for _ in range(max_new_tokens):
        # Crop to context length
        idx_cond = idx[:, -model.config.block_size:]

        # Get predictions
        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature

        # Top-k filtering
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = float('-inf')

        # Sample
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)

        # Append
        idx = torch.cat([idx, idx_next], dim=1)

    return idx

Sampling Strategies

Strategy	Description	Use Case
Greedy	Always pick highest prob	Factual content
Temperature	Scale logits before softmax	Creativity control
Top-k	Only consider top k tokens	Prevent weird outliers
Top-p	Consider until cumsum = p	Adaptive confidence

Temperature intuition: T < 1 = more deterministic, T = 1 = original probs, T > 1 = more random.

Results and Samples

Training Details

Dataset: Shakespeare (~1MB)
Steps: 1000
Hardware: Single GPU

Training and validation loss curves over time — Training progress. Loss drops rapidly early on, then gradually refines. The gap between training and validation loss indicates generalization.

Sample Outputs

Temperature 0.7:

ROMEO:
What say'st thou? Speak again, bright angel, for thou art
As glorious to this night, being o'er my head,
As is a winged messenger of heaven
Unto the white-upturned wondering eyes
Of mortals that fall back to gaze on him...

Temperature 1.0:

ROMEO:
O, she doth teach the torches to burn bright!
It seems she hangs upon the cheek of night
Like a rich jewel in an Ethiope's ear;
Beauty too rich for use, for earth too dear!

Temperature 1.5:

ROMEO:
By my head, here come the Capulets strange matter!
O serpent heart, hid with a flowering face!
Did ever dragon keep so fair a cave?
Beautiful tyrant! fiend angelical!

What It Learned

Iambic pentameter rhythm
Character voices and names
Dramatic sentence structure
Shakespearean vocabulary

All from next-token prediction on ~1MB of text!

Summary and Resources

What We Built

Embeddings: Token IDs → vectors
Attention: Let positions communicate
MLP: Process each position
Residuals + LayerNorm: Enable depth
Training: Cross-entropy, AdamW, cosine schedule
Generation: Autoregressive sampling

GitHub Repository

github.com/prathyusha1231/gpt2-shakespeare

Contents:

Complete model code (~500 lines)
Training script
Tests matching HuggingFace
Documentation for each component

Resources

Andrej Karpathy's build-nanogpt
Attention Is All You Need (Original Transformer paper)
GPT-2 Paper

Closing Thought

The architecture is simple. That's what makes it powerful: the same structure scales from 124M to hundreds of billions of parameters. The elegance is in the simplicity: tokenize, embed, attend, project, sample, repeat.

Missed the Intuition?

Check out Part 1 for a no-math explanation of how GPT-2 works.

Read Part 1: What GPT-2 Actually Does