Learn AI by Building It

We now have all the core components, from embeddings and positional encoding to multi-head attention and feed-forward networks. But a single pass through one attention layer and one FFN isn't enough to capture the full complexity of language. The model needs to refine its representations through multiple rounds of processing. By stacking the same attention-then-FFN block many times in sequence, it develops increasingly abstract patterns at each level.

GPT-2 stacks this block 12 times, and GPT-3 stacks it 96 times. That kind of depth is what gives these models their power, but it also creates serious training problems. This chapter explores what goes wrong and introduces the two techniques that fix it, residual connections and layer normalization.

Why Deep Stacks Break

Each layer applies its own transformation, and stacking means those transformations compound through every subsequent one.

During backpropagation, the gradient flows backward through every layer, multiplied at each one by that layer's local derivatives. In GPT-2, that's twelve multiplications in sequence. If those multipliers average below 1.0, the gradient shrinks at every step until early layers receive almost no learning signal, a problem known as vanishing gradients. If the multipliers average above 1.0, the gradient grows instead, eventually destabilizing training until it diverges. This is called exploding gradients.

0.7 × 0.7 × 0.7 × ... = 0.7¹² = 0.014

A multiplier of 0.7 leaves only 1.4% of the gradient after 12 layers

1.3 × 1.3 × 1.3 × ... = 1.3¹² = 23.3

A multiplier of 1.3 amplifies the gradient 23× over 12 layers

The forward pass also suffers from a related problem. Information from early layers must pass through every subsequent transformation to reach the output, and each layer's transformation can distort or overwrite this information. Over enough layers, the representations that earlier layers produced are increasingly likely to be degraded or lost entirely.

Residual Connections

Both problems share the same structural cause. The layers form a sequential chain where each one receives the previous layer's output, transforms it, and passes the result as the next layer's entire input. The information an early layer encodes can only reach the final output by surviving every transformation in this chain, and any transformation that distorts something useful passes that distortion forward to every layer that follows.

The gradient travels this same chain in reverse. For an early layer to update its weights, the loss signal must flow backward through every intermediate layer, getting multiplied at each one by that layer's local derivatives. Over enough layers, these repeated multiplications might cause gradients to vanish or explode.

Residual connections solve both problems by adding each layer's output to its input instead of replacing it:

output = sublayer(x) + x

The input x bypasses the sublayer entirely while the sublayer's output gets added to it. So the sublayer only needs to learn what to add rather than producing the full output from scratch. The change the sublayer learns to produce is called the residual, which is where the name comes from.

Starting from the token embeddings, information now flows along a pathway that runs through every layer in the model. Each layer taps into this pathway and adds its contribution, and because contributions are added rather than replaced, information from earlier layers can persist all the way to the end. This pathway is called the residual stream.

Residual connections also provide a direct path for gradients in the backward direction. At each addition point, the gradient flows to both inputs of the sum. One path carries it back through the layer. The other bypasses the layer entirely, and the gradient passes through with a multiplier of exactly 1. This guarantees that every layer receives a usable learning signal from the output, regardless of what happens inside the layers themselves. These bypass paths also smooth the loss landscape (Li et al., 2018), making optimization reliable even at considerable depth.

Layer Normalization

In a deep network, the distribution of each layer's inputs changes during training as the parameters of earlier layers update. This forces each layer to continuously readjust to new input distributions rather than learning a stable mapping. Layer normalization addresses this by standardizing each layer's inputs to a consistent scale, neutralizing the distribution shifts that earlier layers introduce.

The core idea is to strip each token's vector down to its underlying structure. Subtracting the mean across the d_model values removes any overall offset, and dividing by the standard deviation removes differences in scale.

LayerNorm(x) = γ ⊙ x − μ√(σ² + ε) + β

⊙ = element-wise multiply ε = small constant for numerical stability γ = learned scale β = learned shift

The small constant ε (typically 10⁻⁵) prevents division by zero when the variance is near zero. To make this concrete, consider a token whose vector has d_model = 4, with values [2, 4, 6, 8]:

Step 1: Mean

μ = x₁ + x₂ + ... + x_dd

μ = (2 + 4 + 6 + 8) / 4 = 5.0

Step 2: Variance

σ² = (x₁ − μ)² + (x₂ − μ)² + ... + (x_d − μ)²d

σ² = ((2−5)² + (4−5)² + (6−5)² + (8−5)²) / 4 = (9+1+1+9) / 4 = 5.0

Step 3: Normalize

x̂ = x − μ√(σ² + ε)

x̂ = [(2−5), (4−5), (6−5), (8−5)] / √5.0

= [−3, −1, 1, 3] / 2.236 = [−1.342, −0.447, 0.447, 1.342]

Step 4: Scale and shift

LayerNorm(x) = γ ⊙ x̂ + β

[γ₁, γ₂, γ₃, γ₄] ⊙ [−1.342, −0.447, 0.447, 1.342] + [β₁, β₂, β₃, β₄]

γ = [γ₁, γ₂, ..., γ_d] and β = [β₁, β₂, ..., β_d] are learned vectors, one value per dimension

The learned parameters γ (scale) and β (shift), each of size d_model, let the network adjust or even undo the normalization for each dimension independently. Since different dimensions encode different features, they may need different scales. The difference from having no normalization at all is that the final output range becomes a deliberate, learned choice rather than uncontrolled drift from earlier layers.

Pre-Norm vs Post-Norm

We now have residual connections to preserve information across depth and layer normalization to stabilize magnitudes. The final step is determining how to arrange these two mechanisms around each sublayer. The specific ordering of these operations directly affects how stably the model trains.

The original Transformer architecture (Vaswani et al., 2017) applies normalization after the residual addition. This arrangement is known as Post-Norm:

# Post-Norm

x = LayerNorm(x + sublayer(x))

Most modern language models take a different approach called Pre-Norm, which applies normalization before the sublayer rather than after:

# Pre-Norm

x = x + sublayer(LayerNorm(x))

The fundamental difference between these approaches is how they affect the identity path. In Post-Norm, the combined output passes through LayerNorm before moving to the next layer. This introduces a nonlinear operation directly on the identity path, which modifies the gradient and disrupts the clean information flow that residual connections are designed to provide.

Pre-Norm avoids this issue by keeping the residual stream entirely clean. Since LayerNorm is isolated within the sublayer branch, the identity path remains an uninterrupted chain of addition operations. The input vector flows straight through without any additional transformation, ensuring that gradients can flow backward with a consistent multiplier of 1 at every step.

Post-Norm

identity path passes through LayerNorm

Pre-Norm

identity path is completely clean

As a result, Post-Norm models can be difficult to train at scale. They typically require careful learning rate warmup schedules and can diverge entirely in deeper networks. Pre-Norm avoids these complications and trains much more stably. This is why it quickly became the standard approach for nearly all large language models.

Because Pre-Norm applies normalization before each sublayer, the final sublayer in the very last transformer block adds its output into the residual stream unnormalized. To account for this, Pre-Norm architectures include one final layer normalization step at the end of the model to stabilize these values. In many codebases, including GPT-2, this is referred to as ln_f.

Summary

Deep networks can be difficult to train because sequential layers repeatedly multiply the error signal during backpropagation, which can cause vanishing or exploding gradients
During the forward pass, information from earlier layers can be degraded or overwritten as it passes through successive sublayers
Residual connections add a bypass pathway (output = sublayer(x) + x), preserving forward information and providing a direct path for gradients with a multiplier of 1
Layer normalization standardizes vectors to zero mean and unit variance, using learned parameters (γ and β) to ensure sublayers receive data at a consistent scale
Post-Norm applies normalization after the residual addition, placing a nonlinear operation on the identity path which can complicate training at greater depths
Pre-Norm applies normalization inside the sublayer branch, keeping the identity path clean and generally enabling more stable training for modern language models

With residual connections preserving information and layer normalization stabilizing the scale, we finally have all the pieces we need. In the next chapter, we will combine embeddings, attention, feed-forward networks, and these two stabilization mechanisms to build the repeatable Transformer block that forms the core of the architecture.