Over the last eight chapters, we have covered the major components of a Transformer-based language model from token embeddings and positional encodings through multi-head attention and feed-forward networks. In this chapter we will bring these components together into a complete architecture.
The GPT architecture arranges these components into a single repeating unit called the Transformer block. The complete model stacks N of these identical blocks between input processing that converts token IDs to vectors and output processing that converts vectors to predictions.
Inside the Block
A single Transformer block contains two sub-blocks arranged in sequence. The first is the attention sub-block, where multi-head attention lets tokens exchange information across positions. The second is the FFN sub-block, where the feed-forward network transforms each token's representation independently.
Both sub-blocks share the same structural pattern. They normalize the input, apply their specific transformation, and add the result back into the residual stream. This is the Pre-Norm arrangement from the previous chapter, and it keeps the main identity path completely clean.
The attention sub-block lets each position gather information from across the full sequence, enriching its representation with context from other tokens. The FFN sub-block then processes each of these representations individually, transforming them through its expand-activate-contract pipeline. Together, one round of cross-token communication followed by one round of per-token transformation completes a single block.
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln1 = nn.LayerNorm(config.d_model)
self.attn = MultiHeadAttention(config)
self.ln2 = nn.LayerNorm(config.d_model)
self.ffn = FeedForward(config)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return xStacking Blocks
A single block refines each token's representation once, performing one round of multi-head self-attention followed by one round of FFN. No matter how wide the block is in terms of d_model, it remains limited to a single step of contextual aggregation and non-linear transformation on the input it receives. As a result, a single block can only capture relatively shallow interactions and lacks the iterative depth needed to progressively build more abstract hierarchical understanding from raw token embeddings.
Stacking blocks solves this limitation by enabling the model to perform iterative refinement through sequential depth. Each subsequent block operates on representations already refined by every preceding block, allowing the model to compose increasingly complex features, from lower-level patterns in early blocks to more abstract relationships in later ones.
Every block in the stack shares the same internal structure, the same two sub-blocks in the same Pre-Norm arrangement. What differs is their learned weights, which determine what each block's attention focuses on and what its FFN computes.
The residual stream connects this entire stack, running as a continuous path from the first block's input to the last block's output. Each block reads from the stream, applies its attention and FFN, and adds the result back. The stream accumulates contributions from every block it has passed through.
The Full Architecture
The full architecture assembles everything covered so far into three stages: input processing converts token IDs into vectors, the block stack refines those vectors, and output processing converts the final vectors into predictions.
Input Processing
The token embedding table from Chapter 3 is a learned matrix of shape vocab_size × d_model, where each row is the embedding for one token. Retrieving an embedding is a direct index operation: use the token ID to select the corresponding row. These row vectors encode semantic properties of each token but no positional information.
Token embeddings carry no positional information, so in Chapter 4 we built sinusoidal positional encoding, computing a fixed vector for each position from sine and cosine waves. GPT-2 replaces the formula with learned positional embeddings (nn.Embedding(context_length, d_model)), where each row is trained directly from data. This gives the model flexibility to discover whatever positional patterns prove useful, but the table only covers positions seen during training. These vectors are added element-wise to the token embeddings, and the combined representations flow into the first block.
x = token_emb + pos_embThe Block Stack
These representations pass through all N blocks in order, each one refining them and adding its output back to the residual stream. After the final block, the stream enters the output stage.
GPT-2 applies dropout during training, randomly zeroing 10% of activations at two points: after the attention weights and before each residual addition. Many newer models skip it entirely, relying on large data volumes as a natural regularizer.
Output Processing
The Pre-Norm arrangement leaves the residual stream unnormalized after the final block. A dedicated LayerNorm, ln_f, corrects this before the output stage. At each position, the model now has a final hidden state: a single d_model-dimensional vector summarizing what it has computed at that position.
The output projection turns that vector into logits over the vocabulary. It produces one raw score, called a logit, for every token ID, so a single hidden state becomes a vector of length vocab_size. Entry i is the logit for token i, and softmax converts those logits into probabilities.
Weight Tying
GPT-2 uses weight tying, which means the output layer reuses the token embedding matrix instead of learning a separate output matrix. The embedding matrix is a learned vocabulary table that maps each token ID to a vector. At the input, the model uses that table to turn token IDs into vectors. At the output, it compares the final hidden state with the learned vector for each token. Reusing the same token vectors for both jobs avoids learning a second vocabulary-sized matrix.
With weight tying, the output projection is just the transpose of the embedding matrix.
In GPT-2 Small, this avoids learning a second 50,257 × 768 matrix, which would add 38,597,376 parameters.
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
# token embeddings
self.wte = nn.Embedding(config.vocab_size, config.d_model)
# positional embeddings
self.wpe = nn.Embedding(config.context_length, config.d_model)
self.blocks = nn.ModuleList([Block(config) for _ in range(config.n_layers)])
self.ln_f = nn.LayerNorm(config.d_model)
self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
self.lm_head.weight = self.wte.weight # weight tying
def forward(self, idx):
tok_emb = self.wte(idx)
pos_emb = self.wpe(torch.arange(idx.shape[1]))
x = tok_emb + pos_emb
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
return self.lm_head(x)GPT-2 Small
The model we implement in the next chapter is GPT-2 Small: 12 blocks, d_model = 768, roughly 124 million parameters.
| Parameter | Value | Note |
|---|---|---|
| d_model | 768 | Embedding dimension |
| n_heads | 12 | Attention heads per block |
| n_layers | 12 | Transformer blocks |
| d_ff | 3072 | 4 × d_model |
| vocab_size | 50,257 | BPE token vocabulary |
| context_length | 1024 | Maximum sequence length |
| dropout | 0.1 | Dropout rate |
| params | ~124M | Total trainable parameters |
- A Transformer block applies two sub-blocks in sequence: multi-head attention for cross-token communication, then FFN for per-token transformation
- Each sub-block uses the Pre-Norm arrangement: normalize, transform, add the residual
- Every block shares the same structure but has its own independent set of learned weights
- Stacking blocks allows the model to compose multi-step relationships, with each block building on representations shaped by all earlier blocks
- The residual stream runs continuously through the stack, accumulating each block's contribution
- The full architecture wraps the block stack with input processing (token + learned positional embeddings) and output processing (final LayerNorm, linear projection to vocabulary, softmax)
- Weight tying reuses the token embedding matrix for the output projection, using the transpose of the embedding table to produce logits
- Dropout zeros out random activations during training to prevent over-reliance on individual features, applied after attention weights and before each residual addition
The full GPT-2 architecture is now in place. In the next chapter, we assemble these components into a working model, train it end to end, and generate text with it.