You know that moment when you're texting and your phone suggests "I'm on my way to the..." and offers you "store," "gym," or "meeting"? Ever wonder how it knows? Your phone has a tiny language model running behind the scenes, predicting what you'll type next based on patterns it learned from billions of text messages.
GPT-2 does exactly the same thing. Just... way bigger. We're talking 124 million parameters instead of a few thousand. And instead of suggesting the next word for your texts, it can write entire essays, code, poetry, or even fake news articles (which is why OpenAI initially hesitated to release it).
I spent weeks building GPT-2 from scratch, following Andrej Karpathy's excellent tutorial, and I want to share what I learned. Not the math-heavy version that assumes you have a PhD, but the version I wish someone had explained to me when I started. You can find all the code on GitHub.
Let's demystify how language models actually work.
The Core Idea: Next Word Prediction
Here's the thing that took me embarrassingly long to internalize: GPT-2 does exactly one thing. It predicts the next word. That's it. All the magic, all the seemingly intelligent responses, all the creative writing: it all comes from one simple operation repeated over and over.
When you give GPT-2 the prompt "The cat sat on the", it doesn't think "hmm, what would make sense here?" It just outputs a probability distribution over its entire vocabulary. Something like:
| Word | Probability |
|---|---|
| mat | 25% |
| floor | 15% |
| couch | 12% |
| roof | 8% |
| table | 6% |
| ... | ... |
| elephant | 0.1% |
Notice: it's not saying "the answer IS mat." It's outputting a distribution. "Mat" is likely, "floor" is pretty likely, "elephant" is technically possible but very unlikely based on all the text the model has seen during training.
So how does GPT-2 generate entire paragraphs? It's just this loop:
- Predict a probability distribution for the next word
- Sample one word from that distribution
- Append that word to the input
- Repeat
"Once upon a time" becomes "Once upon a time there" becomes "Once upon a time there was" becomes... a full story. One word at a time.
Tokenization: Computers Don't Understand Words
Here's problem number one: neural networks only understand numbers. They can't process the word "hello" directly. We need to convert text into integers somehow.
The Naive Approach: Characters
The simplest idea: assign each character a number. a=0, b=1, c=2, and so on. So "hello" becomes [7, 4, 11, 11, 14].
This works, but there's a problem. Sequences get very long. GPT-2 has a context window of 1024 tokens, the maximum number of "things" it can look at. If we use characters, that's only about 200 words of context. We're wasting capacity on individual letters.
The Other Extreme: Whole Words
What if we assign each word a number? "the"=0, "cat"=1, "sat"=2, etc. Much more efficient!
But this creates new problems:
- Vocabulary explosion: English has hundreds of thousands of words
- Unknown words break the system: What happens when someone types "TikTok" or "COVID-19"?
- Lost relationships: The model can't see that "running," "runs," and "ran" are related
The Sweet Spot: Byte Pair Encoding (BPE)
GPT-2 uses something called Byte Pair Encoding. The idea is clever: start with individual characters, then repeatedly merge the most common adjacent pairs until you have about 50,000 tokens.
The result? Common words become single tokens, while rare words get split into pieces:
- "the" is one token
- "unhappiness" might become ["un", "happiness"]
- "ChatGPT" might become ["Chat", "G", "PT"], so it can handle words it's never seen!
GPT-2's vocabulary has exactly 50,257 tokens: 256 base bytes, plus about 50,000 learned merges, plus a special end-of-text token.
In BPE tokenization, spaces are often merged with the following word. So " world" (with the space) is a different token than "world". This is why you sometimes see weird spacing in tokenized text!
Embeddings: From Numbers to Meaning
So now we have token IDs. "Hello world!" might become [15496, 995, 0]. But here's the thing: these numbers are arbitrary. Token 15496 isn't "closer" to 15497 in any meaningful way. And neural networks need continuous values to compute gradients during training.
The solution: a lookup table called an embedding matrix.
Each of our 50,257 tokens gets a vector of 768 numbers. When we see token 15496, we look up row 15496 in the matrix and get back a 768-dimensional vector. These vectors are learned during training.
The magical part? After training, similar words end up with similar vectors:
- "cat" and "dog" are close together (they appear in similar contexts)
- "cat" and "democracy" are far apart
- "king" - "man" + "woman" is close to "queen" (the famous word2vec example)
This embedding layer alone accounts for 38.6 million parameters (about 31% of the entire model)! It's literally just a table lookup, but those learned vectors encode an enormous amount of semantic information about language.
Position Matters
Consider these two sentences:
- "The dog bit the man"
- "The man bit the dog"
Same words, completely different meanings. Word order matters!
If we just looked up embeddings for each word and added them together, both sentences would look identical to the model. We need to tell the model where each word appears.
GPT-2 solves this with positional embeddings: another lookup table, but indexed by position (0, 1, 2, ...) instead of token ID. Position 0 gets one vector, position 1 gets a different vector, and so on up to position 1023.
The final input to the model is simply:
input = token_embedding + position_embedding
Now the model knows both WHAT each token is AND WHERE it appears in the sequence.
Attention: The Secret Sauce
This is the most important part of the transformer architecture, and the key innovation that made GPT possible.
The Problem Attention Solves
Consider this sentence: "The cat sat on the mat because it was soft."
What does "it" refer to? The mat, obviously. But how does a neural network figure that out? "It" is at position 7, "mat" is at position 5. They're two words apart.
Older models (RNNs) passed information step by step, like a game of telephone. By the time you reached "it," the information about "mat" had degraded. Attention is different: every word can directly look at every other word.
How It Works (Intuitively)
When processing "it," the model asks: "Which other words in this sentence are relevant to me?"
It computes relevance scores:
- "mat" - very relevant (0.45)
- "cat" - somewhat relevant (0.25)
- "sat" - not very relevant (0.05)
- ... and so on for every word
Then it uses these scores to create a weighted combination of information from all words. The output for "it" is mostly influenced by "mat" because that's what got the highest attention score.
Multi-Head Attention
GPT-2 doesn't just run one attention operation. It runs 12 in parallel, called "heads." Each head can learn to focus on different patterns:
- Head 1 might focus on: "What noun could this pronoun refer to?"
- Head 2 might focus on: "What's the subject of this sentence?"
- Head 3 might focus on: "What was the previous verb?"
It's like having 12 specialists analyzing the sentence from different angles, then combining their insights.
The One Rule: No Peeking
There's one crucial constraint: when predicting the next word, the model can't look at future words. Otherwise, it could just copy the answer!
This is enforced with a "causal mask," a triangular matrix that blocks attention to future positions. When processing position 3, the model can only see positions 0, 1, 2, and 3. Never position 4 or beyond.
This is why GPT generates text left-to-right, one token at a time. At each step, it can only see what came before.
Putting It All Together
Let's trace through what happens when GPT-2 processes "The cat sat":
- Tokenize: "The cat sat" becomes [464, 3797, 3332]
- Look up token embeddings: Each ID becomes a 768-dimensional vector
- Add position embeddings: Position 0, 1, 2 vectors added to each
- Pass through 12 transformer blocks: Each block has attention + feedforward layers
- Final layer normalization: Stabilizes the output
- Project to vocabulary: Output 50,257 probabilities (one per possible next token)
- Sample: Pick a token based on the probabilities
- Repeat: Add the new token and go again
That's it. That's the whole thing.
124 million parameters, 12 layers of attention, and it all comes down to: tokenize, embed, attend, project, sample, repeat.
Temperature: The Creativity Dial
You might have noticed that ChatGPT sometimes gives different answers to the same question. That's not a bug. It's controlled randomness during the sampling step.
When we have our probability distribution, we have choices:
- Greedy sampling: Always pick the highest probability word. Deterministic but boring and repetitive.
- Random sampling: Sample according to the probabilities. More varied but sometimes produces nonsense.
Temperature is a knob that reshapes the probability distribution before sampling:
- Temperature < 1: Makes peaks higher, valleys lower. More deterministic, "safer" choices.
- Temperature = 1: Original probabilities.
- Temperature > 1: Flattens the distribution. More random, more "creative."
Think of it as a creativity dial:
- Low temperature (0.3): Factual, predictable, repetitive
- Medium temperature (0.7-1.0): Balanced, natural-sounding
- High temperature (1.5+): Creative, surprising, potentially chaotic
This is why the same prompt can give different outputs, and why you can tune the model's behavior without retraining it.
What GPT-2 Isn't Doing
Now that you understand how it works, let's clear up some misconceptions:
It's not retrieving from a database. There's no lookup table of questions and answers. The model generates everything from learned patterns in its parameters.
It's not "understanding" in the human sense. It's statistical pattern matching at a massive scale. It learned that certain words tend to follow other words, and it exploits those patterns convincingly.
It's not thinking step by step. Unless you explicitly prompt it to ("Let's think through this step by step..."), it's just predicting the most probable next token given the context.
It's not magic. At its core, it's matrix multiplications, cleverly organized. Linear algebra with nonlinearities. The architecture is almost embarrassingly simple once you understand it.
And yet... the results are remarkable. Sometimes indistinguishable from human writing. That's the power of scale, data, and the right architecture.
What's Next
If you want to see HOW to build this (the math, the code, the training loop), that's what Part 2: Building GPT-2 from Scratch in PyTorch is all about.
All the code is available on GitHub, including ~500 lines of PyTorch, beginner-friendly documentation, a training script for Shakespeare text, and tests verifying outputs match HuggingFace's implementation.
The simplicity is part of what makes transformers so powerful. The same basic architecture that runs GPT-2 with 124 million parameters also runs GPT-4 with (reportedly) over a trillion. Scale up the data, scale up the parameters, and watch the capabilities emerge.
Ready to Build It Yourself?
See the code, the math, and the implementation details in Part 2.
Read Part 2: Building GPT-2 from Scratch