Back to Roadmap09

The Transformer Block

How all components combine into the repeatable transformer block.

Over the last eight chapters, we have covered the major components of a Transformer-based language model from token embeddings and positional encodings through multi-head attention and feed-forward networks. In this chapter we will bring these components together into a complete architecture.

The GPT architecture arranges these components into a single repeating unit called the Transformer block. The complete model stacks N of these identical blocks between input processing that converts token IDs to vectors and output processing that converts vectors to predictions.

Token IDsInput ProcessingTransformer BlockAttn + FFN×NOutput ProcessingProbabilities

Inside the Block

A single Transformer block contains two sub-blocks arranged in sequence. The first is the attention sub-block, where multi-head attention lets tokens exchange information across positions. The second is the FFN sub-block, where the feed-forward network transforms each token's representation independently.

inputLayerNormMulti-Head Attn+LayerNormFFN+output

Both sub-blocks share the same structural pattern. They normalize the input, apply their specific transformation, and add the result back into the residual stream. This is the Pre-Norm arrangement from the previous chapter, and it keeps the main identity path completely clean.

Attention sub-block
x₁ = x + MultiHeadAttention(LayerNorm(x))
FFN sub-block
x₂ = x₁ + FFN(LayerNorm(x₁))

The attention sub-block lets each position gather information from across the full sequence, enriching its representation with context from other tokens. The FFN sub-block then processes each of these representations individually, transforming them through its expand-activate-contract pipeline. Together, one round of cross-token communication followed by one round of per-token transformation completes a single block.

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.d_model)
        self.attn = MultiHeadAttention(config)
        self.ln2 = nn.LayerNorm(config.d_model)
        self.ffn = FeedForward(config)
 
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

Stacking Blocks

A single block refines each token's representation once, performing one round of multi-head self-attention followed by one round of FFN. No matter how wide the block is in terms of d_model, it remains limited to a single step of contextual aggregation and non-linear transformation on the input it receives. As a result, a single block can only capture relatively shallow interactions and lacks the iterative depth needed to progressively build more abstract hierarchical understanding from raw token embeddings.

Stacking blocks solves this limitation by enabling the model to perform iterative refinement through sequential depth. Each subsequent block operates on representations already refined by every preceding block, allowing the model to compose increasingly complex features, from lower-level patterns in early blocks to more abstract relationships in later ones.

Every block in the stack shares the same internal structure, the same two sub-blocks in the same Pre-Norm arrangement. What differs is their learned weights, which determine what each block's attention focuses on and what its FFN computes.

residual streaminputBlock 1Attn + FFN · weights₁+Block 2Attn + FFN · weights₂+···Block NAttn + FFN · weightsₙ+output

The residual stream connects this entire stack, running as a continuous path from the first block's input to the last block's output. Each block reads from the stream, applies its attention and FFN, and adds the result back. The stream accumulates contributions from every block it has passed through.

The Full Architecture

The full architecture assembles everything covered so far into three stages: input processing converts token IDs into vectors, the block stack refines those vectors, and output processing converts the final vectors into predictions.

Token IDsToken Embed+Pos Embed×NBlock 1Attn + FFN · weights₁+···Block NAttn + FFN · weightsₙ+ln_f (LayerNorm)Output Proj→ vocab sizeweight tyinglogitsSoftmaxnext-token probs

Input Processing

The token embedding table from Chapter 3 is a learned matrix of shape vocab_size × d_model, where each row is the embedding for one token. Retrieving an embedding is a direct index operation: use the token ID to select the corresponding row. These row vectors encode semantic properties of each token but no positional information.

Token embeddings carry no positional information, so in Chapter 4 we built sinusoidal positional encoding, computing a fixed vector for each position from sine and cosine waves. GPT-2 replaces the formula with learned positional embeddings (nn.Embedding(context_length, d_model)), where each row is trained directly from data. This gives the model flexibility to discover whatever positional patterns prove useful, but the table only covers positions seen during training. These vectors are added element-wise to the token embeddings, and the combined representations flow into the first block.

x = token_emb + pos_emb

The Block Stack

These representations pass through all N blocks in order, each one refining them and adding its output back to the residual stream. After the final block, the stream enters the output stage.

Dropout

GPT-2 applies dropout during training, randomly zeroing 10% of activations at two points: after the attention weights and before each residual addition. Many newer models skip it entirely, relying on large data volumes as a natural regularizer.

Output Processing

The Pre-Norm arrangement leaves the residual stream unnormalized after the final block. A dedicated LayerNorm, ln_f, corrects this before the output stage. At each position, the model now has a final hidden state: a single d_model-dimensional vector summarizing what it has computed at that position.

The output projection turns that vector into logits over the vocabulary. It produces one raw score, called a logit, for every token ID, so a single hidden state becomes a vector of length vocab_size. Entry i is the logit for token i, and softmax converts those logits into probabilities.

logits = h × W_proj
h: (1 × d_model)  W_proj: (d_model × vocab_size)  logits: (1 × vocab_size)

Weight Tying

GPT-2 uses weight tying, which means the output layer reuses the token embedding matrix instead of learning a separate output matrix. The embedding matrix is a learned vocabulary table that maps each token ID to a vector. At the input, the model uses that table to turn token IDs into vectors. At the output, it compares the final hidden state with the learned vector for each token. Reusing the same token vectors for both jobs avoids learning a second vocabulary-sized matrix.

With weight tying, the output projection is just the transpose of the embedding matrix.

W_proj = E^T
logits = h × E^T
E: (vocab_size × d_model)  h: (1 × d_model)  logits: (1 × vocab_size)

In GPT-2 Small, this avoids learning a second 50,257 × 768 matrix, which would add 38,597,376 parameters.

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        # token embeddings
        self.wte = nn.Embedding(config.vocab_size, config.d_model)
        # positional embeddings
        self.wpe = nn.Embedding(config.context_length, config.d_model)
        self.blocks = nn.ModuleList([Block(config) for _ in range(config.n_layers)])
        self.ln_f = nn.LayerNorm(config.d_model)
        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
        self.lm_head.weight = self.wte.weight  # weight tying
 
    def forward(self, idx):
        tok_emb = self.wte(idx)
        pos_emb = self.wpe(torch.arange(idx.shape[1]))
        x = tok_emb + pos_emb
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        return self.lm_head(x)

GPT-2 Small

The model we implement in the next chapter is GPT-2 Small: 12 blocks, d_model = 768, roughly 124 million parameters.

ParameterValueNote
d_model768Embedding dimension
n_heads12Attention heads per block
n_layers12Transformer blocks
d_ff30724 × d_model
vocab_size50,257BPE token vocabulary
context_length1024Maximum sequence length
dropout0.1Dropout rate
params~124MTotal trainable parameters
Summary
  • A Transformer block applies two sub-blocks in sequence: multi-head attention for cross-token communication, then FFN for per-token transformation
  • Each sub-block uses the Pre-Norm arrangement: normalize, transform, add the residual
  • Every block shares the same structure but has its own independent set of learned weights
  • Stacking blocks allows the model to compose multi-step relationships, with each block building on representations shaped by all earlier blocks
  • The residual stream runs continuously through the stack, accumulating each block's contribution
  • The full architecture wraps the block stack with input processing (token + learned positional embeddings) and output processing (final LayerNorm, linear projection to vocabulary, softmax)
  • Weight tying reuses the token embedding matrix for the output projection, using the transpose of the embedding table to produce logits
  • Dropout zeros out random activations during training to prevent over-reliance on individual features, applied after attention weights and before each residual addition

The full GPT-2 architecture is now in place. In the next chapter, we assemble these components into a working model, train it end to end, and generate text with it.