The previous chapter ended with a limitation: embeddings capture what tokens mean, but not where they appear. Since Transformers process all tokens in parallel, "Alice helped Bob" and "Bob helped Alice" produce identical embedding sets with no indication of order. We need a way to encode position.
This chapter explores one line of reasoning that leads to the original Transformer's solution. We'll start with naive approaches, see why they fail, and build toward the sinusoidal encoding that the "Attention Is All You Need" paper introduced.
1. Attempt 1: Using the Position Index Directly
The most straightforward approach is to use the position number itself (0, 1, 2, ...) and add it to each dimension of the token embedding. If "Apple" at position 3 has embedding [0.1, -0.2, 0.3], adding 3 produces [3.1, 2.8, 3.3].
This has two problems. First, position values grow without bound as sequences get longer. Second, there's a scale mismatch: token embeddings are small values centered near zero, while position indices can be arbitrarily large. For instance, at position 1000, the representation becomes [1000.1, 999.8, 1000.3] dominated by position rather than meaning. These large values can also destabilize training by causing extreme activations and gradients.
To fix the unbounded growth, we could represent position using its digits. Position 42 becomes [..., 4, 2], with values bounded between 0 and 9 regardless of sequence length, with zeros padding the higher places.
This solves the unbounded growth, but not the scale mismatch. Embedding values are typically small and centered around zero, while digits range from 0 to 9. The position signal still dominates the dimensions it occupies, and training stability issues may persist, though less severely than with raw indices. Also, most dimensions end up as zeros, wasting capacity.
2. Attempt 2: Scaling to [0, 1]
Both problems in Attempt 1 stem from unbounded or mismatched scales. A natural fix is to normalize by dividing the position index by the sequence length, keeping every value strictly between 0 and 1.
Position = index / sequence_length
This fixes the scale issues but introduces new problems.
First, the spacing between adjacent positions now depends on sequence length. In a 10-token sequence, neighbors are 0.1 apart; in a 50-token sequence, only 0.02 apart. The model has no way to learn a consistent notion of "adjacent" when the distance keeps changing.
Second, the same position index maps to different values in different sequences. Index 1 becomes 0.33 in a 3-token sequence but 0.17 in a 6-token sequence. The model cannot associate a stable meaning with any particular position.
3. What Might Make a Good Positional Encoding?
None of the approaches so far quite work. Raw indices grow without bound and create scale mismatches. Digit representations stay bounded but still don't match embedding scales. Normalization fixes scale but makes spacing and position values inconsistent across sequences. Before trying again, let's step back and think about what properties we might want in a positional encoding. These aren't hard requirements, just reasonable hypotheses about what could help.
1. Scale Compatibility
The position signal should live in a reasonable numeric range, comparable to typical embedding values, so it doesn't overwhelm the token's meaning. We want the model to represent "Apple at position 5", not a vector dominated by the position.
2. Unique and Deterministic
Every position within the context window should produce a distinct encoding, and that encoding should be identical every time. Position 5 must always map to the exact same vector, whether during training or inference. If two positions share an encoding, the model has no way to tell them apart.
3. Consistent Relative Structure
Ideally, relative offsets should behave consistently regardless of absolute position. "Three tokens ahead" should look the same whether you're near the start or deep into a sequence. This consistency could help the model learn portable patterns for attending to nearby or distant tokens.
4. Extendability
It would be convenient if the encoding could extend to positions beyond what the model saw during training.In practice, this doesn't guarantee better generalization, but an encoding that doesn't break at unseen positions is at least preferable to one that does.
5. Smooth and Continuous
Positions are discrete, but it helps if the positional encoding changes gradually from one position to the next. Neural networks learn through small weight adjustments, and this works better when small input changes lead to small output changes. An encoding where adjacent positions have similar representations gives the model more structure to work with.
With these properties in mind, the question becomes: how can we represent many positions using small, bounded values while keeping the signal smooth and consistent?
4. Replacing Digits with Oscillations
Number systems offer a useful insight. They use multiple columns cycling at different speeds, allowing us to represent arbitrarily large numbers with bounded symbols. In decimal, the ones column cycles 0→9, the tens column changes every 10 counts, the hundreds every 100. This multi-rate structure gives us bounded, unique values for each position.
Decimal digits (0-9) have scale issues as we saw in Attempt 1. Binary might seem like a better fit since its values (0 and 1) are closer to typical embedding scales, while still providing the same multi-frequency structure.
But both decimal and binary are discrete. Values jump between states rather than changing gradually, as shown on the left side of the visual below. In Section 3, we hypothesized that smoothness could help since neural networks learn better when small input changes lead to small output changes. What if we kept the multi-frequency structure but replaced discrete values with smooth oscillations?
This gives us the best of both: multiple dimensions changing at different speeds for bounded, unique encodings, and gradual changes between adjacent positions rather than abrupt jumps.
5. Finding the Function
If you have taken trigonometry, the most natural function to model this smooth, bounded oscillation is the sine wave. The visualization above uses a shifted version (sin(x) + 1) / 2 to keep values between 0 and 1.
In practice, the normal sine with its centered range [-1, +1] is usually preferred. Embedding values are typically initialized around 0, and techniques like Layer Normalization (covered later) keep them there throughout the network. A zero-centered positional signal blends naturally with this distribution, whereas a [0, 1] shift would add an unnecessary constant positive bias to every dimension.
Applying this to our multi-frequency idea, each dimension gets its own sine wave cycling at a different frequency. Play the animation to see how each dimension evolves across positions:
Fast waves change quickly, helping separate nearby positions. Slow waves change gradually, helping distinguish distant positions. Together, they create a unique fingerprint for every position within the context window.
6. Making Relative Positions Easy
We now have a working positional signal using sine waves at different frequencies, giving each position a unique fingerprint. However, the original Transformer paper actually pairs every sine with a cosine at the same frequency. Why might that be?
One motivation involves how positions relate to each other. Language understanding often depends on relative positions: "the token immediately before" or "three tokens ahead." If the encoding makes these relationships easy to compute, it might help the model learn them. Ideally, going from position pos to position pos + k would follow a simple, predictable pattern.
What if we only stored sine? Suppose our encoding is just sin(ω · pos) for a given frequency. What happens when we shift to position pos + k? From high school trigonometry, we can expand sin(ω(pos + k)) using the angle addition formula:
sin(ω · (pos + k)) = sin(ω · pos) · cos(ω · k) + cos(ω · pos) · sin(ω · k)
Look at what's needed on the right side: sin(ω · pos), which we have, and cos(ω · pos), which we don't. The model could learn to approximate the missing cosine internally, but we can make its job easier by storing both values from the start.
Pairing sine with cosine for each frequency ω, we store both sin(ω · pos) and cos(ω · pos) as a 2D pair:
With cosine now available, here's how the shift looks for both components:
Notice something important: the right side only uses sin(ω·pos) and cos(ω·pos), which we already have stored, multiplied by constants that depend only on the offset k. We can write the shifted encoding as a matrix multiplication:
The rotation matrix: The 2×2 matrix in the middle is a rotation matrix, the standard matrix that rotates any 2D point by a fixed angle. In our case, that angle is ω · k. Geometrically, the (sin, cos) pair traces out a circle as position increases, and shifting by k positions corresponds to rotating this point around the circle by angle ω · k:
Since the rotation matrix depends only on k (the offset), "k steps ahead" applies the same linear transformation regardless of starting position. Transformers are built on matrix multiplication, so expressing position shifts this way could help the model learn consistent relative position patterns.
7. Building the Formula
We now have all the design choices in place. In this section, we derive the final positional encoding formula. First, we need to decide how to combine the position vector with the token embedding (size d_model). There are two main approaches:
We could concatenate the position vector to the embedding, but this increases the input dimension, which means larger weight matrices and more parameters throughout the network. Addition avoids this by adding the vectors element-wise, preserving the original d_model dimensions. Although this mixes position and meaning into one vector, the network can learn to disentangle these signals during training, making addition the standard choice (Vaswani et al., 2017).
Choosing addition means our positional encoding must produce exactly d_model numbers per position. Each sine/cosine pair occupies 2 dimensions, so we can fit d_model / 2 pairs total. With d_model = 512, that gives us 256 pairs, indexed from 0 to 255. A simple arrangement is interleaving: the i-th pair puts its sine at dimension 2i and its cosine at dimension 2i + 1. For example, pair 0 uses dimensions 0 and 1, pair 1 uses dimensions 2 and 3, all the way up to pair 255 using dimensions 510 and 511.
Now we need to choose the actual frequencies. We need a fastest frequency to distinguish neighboring positions (like the ones digit in decimal distinguishes consecutive numbers), and a slowest frequency that doesn't repeat within our context window.
For the fastest frequency, ω = 1 cycles quickly, completing a full wave every ~6 positions (wavelength = 2π/ω = 2π/1 ≈ 6.28), giving fine-grained resolution near each position. For the slowest frequency, we want its wavelength to span the entire context window. Just as two decimal digits can only count to 99 before wrapping, our slowest wave should complete less than one full cycle within our target sequence length. Let's call this wavelength parameter base.
With our endpoints set (ω = 1 at the fast end, ω = 1/base at the slow end), we need to fill the frequencies in between. Number systems do this naturally: in binary, each column cycles 2× slower than the previous; in decimal, 10× slower. We adopt the same approach, computing the ratio that steps us smoothly from 1 down to 1/base across our d_model / 2 frequency pairs. This gives us evenly spaced coverage across all scales.
The formula ωᵢ = 1/base^(2i/d_model) achieves exactly what we need. The exponent 2i/d_model slides from 0 at the first pair (giving ω = 1) toward 1 at the last pair (giving ω ≈ 1/base), placing all intermediate frequencies on a smooth geometric curve. The original Transformer uses base = 10000, which yields a slowest wavelength of about 63,000 positions, well beyond typical context windows at the time. Any base large enough to cover your target sequence length works. The value 10000 was simply a safe default. Putting it all together:
PEi(pos) = [ sin(pos / 100002i/dmodel), cos(pos / 100002i/dmodel) ]
where i ∈ [0, dmodel/2) and the i-th pair fills dimensions 2i and 2i+1
That's the complete formula. Every piece traces back to a design choice we made along the way: sine/cosine pairs for rotation-friendly relative positions, multiple frequencies for scale separation, and addition to preserve the architecture's dimensionality.
- Raw position indices grow without bound, overwhelming the semantic signal and causing training instability
- Multiple dimensions cycling at different speeds allow large positions to be encoded with small, bounded values, similar to how decimal uses ones, tens, and hundreds columns
- Sinusoidal waves change gradually between positions while staying bounded between -1 and +1
- Sine/cosine pairs enable position shifts to be expressed as linear transformations (rotation matrices), which aligns with the Transformer's matrix-based computations
- Frequency spread from fast (1) to slow (1/10000) gives each position a unique fingerprint across multiple scales
- Addition combines position and meaning while preserving
d_modeldimensionality, avoiding extra parameters
This isn't the only line of reasoning to arrive at positional encoding, nor the only solution. With this in place, we have the complete input representation. In the next chapter, we'll see how the Transformer uses these vectors to let tokens attend to each other.