In the previous chapter, we've established that raw bytes create sequences that are too long, and word-level tokenization creates vocabularies that are too large. The solution is subword tokenization: chunking text into efficient pieces that balance sequence length with vocabulary size.
The industry-standard algorithm for this is Byte Pair Encoding (BPE). In this chapter, we'll understand how it works and build it from scratch.
1. The Key Insight
BPE is a compression algorithm. It finds patterns that appear frequently in the training data and replaces them with shorter symbols.
Think about how you text. You don't type "Laughing Out Loud" every time. That's 17 characters. Instead, you type "LOL" (3 characters). You've mentally agreed that this sequence appears so frequently it deserves to be a single unit.
BPE does exactly this, but automatically. It reads through massive amounts of text (like Wikipedia) and asks: "Which sequences of characters appear together most often?"
By the end of this process, common words like "apple" become single tokens, while rare words are split into multiple smaller tokens. This allows the model to process text more efficiently.
2. BPE Walkthrough
To understand BPE, we need to run the algorithm by hand. We'll use a short sequence to see the mechanics clearly before scaling up to real text.
The Dataset
We'll train BPE on this 11-character string:
Starting Point
Before any merges, each character is its own token. Our sequence is 11 tokens long.
Step 1: Count the Pairs
BPE scans every adjacent pair of tokens and counts how often each appears:
Merge
The pair (a, a) appears most often (4 times), so we merge it into a new token Z. Every aa in the original sequence becomes Z:
Length: 9 tokens (down from 11)
This is compression. The information is preserved, but the sequence is shorter. This is the core of BPE.
Repeat
We count pairs again on the updated sequence. Both (Z, a) and (a, b) now appear twice. When pairs tie in frequency, the algorithm picks one arbitrarily. Here we choose (a, b) and merge it into a new token Y:
Length: 7 tokens (down from 9)
Result
On large text corpora, BPE typically achieves 50-60% compression compared to raw bytes.
3. Implementing BPE
Now we translate the walkthrough into code.
In the walkthrough, we used letters like a, b, c for clarity. But recall from Chapter 1: text is already a sequence of integers. When you encode "hello" as UTF-8, you get [104, 101, 108, 108, 111]. These byte values (0-255) form our base vocabulary of 256 tokens. Merged tokens get new IDs starting at 256, then 257, 258, and so on.
Breaking Down the Algorithm
In the walkthrough, we repeated three operations: count pairs, pick the most frequent, and merge it. We kept going until we decided to stop (after 2 merges). In practice, we continue until we reach a target vocabulary size. To implement the algorithm, we need three functions:
Counts how often each adjacent pair appears
Replaces all occurrences of a pair with a new token
The main loop: count pairs, pick the most frequent, merge it, repeat until target vocab size
Here's each function with its signature. You'll implement them in the challenges.
Counting Pairs
get_stats scans a list of token IDs and counts how often each adjacent pair occurs. For example, [1, 2, 3, 1, 2] returns {(1, 2): 2, (2, 3): 1, (3, 1): 1}.
def get_stats(ids: list[int]) -> dict[tuple, int]:
"""
Given a list of integers, return a dictionary of counts of consecutive pairs.
Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
"""
counts = {}
# Loop through the list and look at each element and its neighbor
# Your implementation here...
return countsTry implementing this yourself! Switch to the Challenge tab to test your solution.
Merging a Pair
merge takes the token list and replaces every occurrence of a target pair with a new token ID.
def merge(ids: list[int], pair: tuple[int, int], new_id: int) -> list[int]:
"""
In the list of integers (ids), replace all consecutive occurrences
of pair with the new token new_id.
"""
newids = []
i = 0
while i < len(ids):
# Check if we found the pair (and aren't at the end)
# If match: append new_id and skip ahead by 2 (we consumed both elements)
# If no match: append current element and move ahead by 1
pass
return newidsThis one requires careful index management. Switch to the Challenge tab to try it!
The Training Loop
train_bpe combines everything into a training loop. We start with UTF-8 bytes (vocab size 256) and merge until we reach a target size. Large models use 50,000 to 100,000 tokens, balancing efficiency (shorter sequences) vs. memory (larger embedding table).
def train_bpe(text: str, num_merges: int) -> tuple[list[int], dict]:
ids = list(text.encode("utf-8")) # Start with raw bytes
merges = {} # Will store our learned merge rules
for i in range(num_merges):
# Find the most frequent pair, merge it, record the rule
# New token IDs start at 256 (after the 0-255 byte range)
pass
return ids, mergesPut it all together! Switch to the Challenge tab to implement the complete training function.
After training, merges contains all the rules needed to tokenize new text, and ids contains your compressed training data.
4. Regex Pre-splitting
BPE merges pairs across the entire text without understanding word boundaries. This means punctuation can merge with words: dog. might become one token, while dog (with space) becomes another. The same word gets tokenized differently based on what comes after it:
The word "dog" has no consistent representation. Its tokenization depends on surrounding characters.
The Fix
The fix is to split text into chunks before BPE runs, forcing boundaries between different character types (words, punctuation, numbers, whitespace). Modern tokenizers use Regular Expressions for this:
Now "dog" and "." are separate chunks, so they can't merge into a single token.
By forcing boundaries before BPE runs, words, punctuation, and numbers stay in separate chunks. BPE then merges only within these chunks, so it can't create tokens that mix different character types.
5. Encoding & Decoding
After training, we have a set of merge rules, but we still need a way to use them. We need to encode new text into tokens (so the model can process it) and decode tokens back into text (so we can read the model's output).
Decoding: Tokens → Text
To decode tokens back into text, we need a vocabulary table that maps each token ID to its byte sequence. For the base vocabulary (0-255), each token ID equals its byte value, so token 97 maps to [97]. For merged tokens, we combine the bytes of the two tokens that were merged: if token 256 represents the merge of a (97) and t (116), it maps to [97, 116]:
When you create a new merged token, immediately store its bytes by concatenating the bytes of the first token and the bytes of the second token: vocab[new_id] = vocab[pair[0]] + vocab[pair[1]]. After training, every token’s bytes are ready for instant lookup.
With this vocabulary, decoding becomes a dictionary lookup:
def decode(ids: list[int], vocab: dict[int, list[int]]) -> str:
"""Convert token IDs back to text."""
# Look up each ID in vocab, concatenate the bytes, decode as UTF-8
passTry implementing this in the Challenge tab.
Decoding Example
Encoding: Text → Tokens
To encode new text, we start with raw bytes and apply each merge rule in the order it was learned, since later merge rules may depend on tokens created by earlier ones.
def encode(text: str, merges: dict[tuple[int, int], int]) -> list[int]:
"""Tokenize text using learned merge rules."""
ids = list(text.encode("utf-8")) # Start with raw bytes
# Apply each merge rule in the order it was learned
# Hint: iterate through merges.items() and use your merge function
passSince Python dictionaries preserve insertion order, we can iterate through merges and apply each rule in the order it was learned. To see why order matters, consider: if we learned (116,104)→256 first and (256,101)→257 second, the second rule looks for the pair (256, 101) in the token sequence, but token 256 doesn't exist yet. The rule finds no match and gets skipped.
For example, encoding "the" with merge rules (116,104)→256 (t,h→th) and (256,101)→257 (th,e→the):
[116, 104, 101] → [256, 101] th, e
[256, 101] → [257] the
[116, 104, 101] → [116, 104, 101] (no 256 yet, skipped!)
[116, 104, 101] → [256, 101] th, e
Encoding Example
Try implementing this in the Challenge tab.
6. Recap
Let's step back and see the complete flow from text to token IDs.
The original input string.
Pre-split into chunks. Punctuation stays separate.
Each chunk becomes a list of bytes.
Apply learned merge rules to compress bytes into tokens.
The final sequence of integers.
- Tokenization converts text into a sequence of integer IDs
- BPE learns merge rules by iteratively combining frequent byte pairs
- Regex pre-splitting keeps words and punctuation in separate chunks
- Decoding converts tokens back to text by looking up each token's bytes
- Encoding converts text to tokens by applying merge rules in order they were learned
We now have a sequence of token IDs. But these are just identifiers. They don't tell us anything about what the tokens mean. In the next chapter, we'll see how Embeddings turn each token ID into a vector that captures meaning.