Tokenization

How BPE compresses text into efficient tokens.

In the previous chapter, we've established that raw bytes create sequences that are too long, and word-level tokenization creates vocabularies that are too large. The solution is subword tokenization: chunking text into efficient pieces that balance sequence length with vocabulary size.

The industry-standard algorithm for this is Byte Pair Encoding (BPE). In this chapter, we'll understand how it works and build it from scratch.

1. The Key Insight

BPE is a compression algorithm. It finds patterns that appear frequently in the training data and replaces them with shorter symbols.

Think about how you text. You don't type "Laughing Out Loud" every time. That's 17 characters. Instead, you type "LOL" (3 characters). You've mentally agreed that this sequence appears so frequently it deserves to be a single unit.

"Laughing Out Loud"
17 characters
"LOL"
3 characters

BPE does exactly this, but automatically. It reads through massive amounts of text (like Wikipedia) and asks: "Which sequences of characters appear together most often?"

Common Patterns Learned
t + hthappears constantly together
th + etheone of the most common words
in + gingcommon suffix

By the end of this process, common words like "apple" become single tokens, while rare words are split into multiple smaller tokens. This allows the model to process text more efficiently.

2. BPE Walkthrough

To understand BPE, we need to run the algorithm by hand. We'll use a short sequence to see the mechanics clearly before scaling up to real text.

The Dataset

We'll train BPE on this 11-character string:

aaabdaaabac

Starting Point

Before any merges, each character is its own token. Our sequence is 11 tokens long.

a
a
a
b
d
a
a
a
b
a
c

Step 1: Count the Pairs

BPE scans every adjacent pair of tokens and counts how often each appears:

Adjacent pairs in our sequence (same color = same pair type)
aa
aa
ab
bd
da
aa
aa
ab
ba
ac
(a,a) × 4(a,b) × 2(b,d) × 1(d,a) × 1(b,a) × 1(a,c) × 1
(a,a)
4
(a,b)
2
(b,d)
1
(d,a)
1
(b,a)
1
(a,c)
1

Merge

The pair (a, a) appears most often (4 times), so we merge it into a new token Z. Every aa in the original sequence becomes Z:

After Merge #1: (a, a) → Z
Z
a
b
d
Z
a
b
a
c

Length: 9 tokens (down from 11)

Key Insight

This is compression. The information is preserved, but the sequence is shorter. This is the core of BPE.

Repeat

We count pairs again on the updated sequence. Both (Z, a) and (a, b) now appear twice. When pairs tie in frequency, the algorithm picks one arbitrarily. Here we choose (a, b) and merge it into a new token Y:

After Merge #2: (a, b) → Y
Z
Y
d
Z
Y
a
c

Length: 7 tokens (down from 9)

Result

11
tokens (start)
7
tokens (after 2 merges)

On large text corpora, BPE typically achieves 50-60% compression compared to raw bytes.

3. Implementing BPE

Now we translate the walkthrough into code.

In the walkthrough, we used letters like a, b, c for clarity. But recall from Chapter 1: text is already a sequence of integers. When you encode "hello" as UTF-8, you get [104, 101, 108, 108, 111]. These byte values (0-255) form our base vocabulary of 256 tokens. Merged tokens get new IDs starting at 256, then 257, 258, and so on.

Breaking Down the Algorithm

In the walkthrough, we repeated three operations: count pairs, pick the most frequent, and merge it. We kept going until we decided to stop (after 2 merges). In practice, we continue until we reach a target vocabulary size. To implement the algorithm, we need three functions:

1
get_stats

Counts how often each adjacent pair appears

2
merge

Replaces all occurrences of a pair with a new token

3
train_bpe

The main loop: count pairs, pick the most frequent, merge it, repeat until target vocab size

Here's each function with its signature. You'll implement them in the challenges.

Counting Pairs

get_stats scans a list of token IDs and counts how often each adjacent pair occurs. For example, [1, 2, 3, 1, 2] returns {(1, 2): 2, (2, 3): 1, (3, 1): 1}.

def get_stats(ids: list[int]) -> dict[tuple, int]:
    """
    Given a list of integers, return a dictionary of counts of consecutive pairs.
    Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
    """
    counts = {}
    # Loop through the list and look at each element and its neighbor
    # Your implementation here...
    return counts
Challenge: The Pair Counter

Try implementing this yourself! Switch to the Challenge tab to test your solution.

Merging a Pair

merge takes the token list and replaces every occurrence of a target pair with a new token ID.

def merge(ids: list[int], pair: tuple[int, int], new_id: int) -> list[int]:
    """
    In the list of integers (ids), replace all consecutive occurrences 
    of pair with the new token new_id.
    """
    newids = []
    i = 0
    while i < len(ids):
        # Check if we found the pair (and aren't at the end)
        # If match: append new_id and skip ahead by 2 (we consumed both elements)
        # If no match: append current element and move ahead by 1
        pass
    return newids
Challenge: The Token Merger

This one requires careful index management. Switch to the Challenge tab to try it!

The Training Loop

train_bpe combines everything into a training loop. We start with UTF-8 bytes (vocab size 256) and merge until we reach a target size. Large models use 50,000 to 100,000 tokens, balancing efficiency (shorter sequences) vs. memory (larger embedding table).

def train_bpe(text: str, num_merges: int) -> tuple[list[int], dict]:
    ids = list(text.encode("utf-8"))  # Start with raw bytes
    merges = {}  # Will store our learned merge rules
    
    for i in range(num_merges):
        # Find the most frequent pair, merge it, record the rule
        # New token IDs start at 256 (after the 0-255 byte range)
        pass
    
    return ids, merges
Challenge: The BPE Trainer

Put it all together! Switch to the Challenge tab to implement the complete training function.

After training, merges contains all the rules needed to tokenize new text, and ids contains your compressed training data.

4. Regex Pre-splitting

BPE merges pairs across the entire text without understanding word boundaries. This means punctuation can merge with words: dog. might become one token, while dog (with space) becomes another. The same word gets tokenized differently based on what comes after it:

The Problem
"dog" + space → one pattern
"dog" + period → different pattern
"dog" + exclamation → yet another pattern

The word "dog" has no consistent representation. Its tokenization depends on surrounding characters.

The Fix

The fix is to split text into chunks before BPE runs, forcing boundaries between different character types (words, punctuation, numbers, whitespace). Modern tokenizers use Regular Expressions for this:

The Solution
Input:"dog. dog!"
After regex:
"dog""."" dog""!"

Now "dog" and "." are separate chunks, so they can't merge into a single token.

By forcing boundaries before BPE runs, words, punctuation, and numbers stay in separate chunks. BPE then merges only within these chunks, so it can't create tokens that mix different character types.

5. Encoding & Decoding

After training, we have a set of merge rules, but we still need a way to use them. We need to encode new text into tokens (so the model can process it) and decode tokens back into text (so we can read the model's output).

Decoding: Tokens → Text

To decode tokens back into text, we need a vocabulary table that maps each token ID to its byte sequence. For the base vocabulary (0-255), each token ID equals its byte value, so token 97 maps to [97]. For merged tokens, we combine the bytes of the two tokens that were merged: if token 256 represents the merge of a (97) and t (116), it maps to [97, 116]:

The Vocabulary: Token ID → Bytes
# Base vocabulary: single bytes (0-255)
vocab[97] = [97] # "a"
vocab[98] = [98] # "b"
vocab[99] = [99] # "c"
...
# Merged tokens: pre-computed byte sequences
vocab[256] = [97, 97] # "aa"
vocab[257] = [97, 98] # "ab"
vocab[258] = [97, 97, 98] # "aab"
...
Building the Vocabulary During Training

When you create a new merged token, immediately store its bytes by concatenating the bytes of the first token and the bytes of the second token: vocab[new_id] = vocab[pair[0]] + vocab[pair[1]]. After training, every token’s bytes are ready for instant lookup.

With this vocabulary, decoding becomes a dictionary lookup:

def decode(ids: list[int], vocab: dict[int, list[int]]) -> str:
    """Convert token IDs back to text."""
    # Look up each ID in vocab, concatenate the bytes, decode as UTF-8
    pass
Challenge: The Decoder

Try implementing this in the Challenge tab.

Decoding Example

Decoding [258, 99] back to text
Token IDs:
258
99
Look up bytes:
vocab[258]
[97, 97, 98]
vocab[99]
[99]
Concatenate:
[97, 97, 98, 99]
Decode UTF-8:
"aabc"

Encoding: Text → Tokens

To encode new text, we start with raw bytes and apply each merge rule in the order it was learned, since later merge rules may depend on tokens created by earlier ones.

def encode(text: str, merges: dict[tuple[int, int], int]) -> list[int]:
    """Tokenize text using learned merge rules."""
    ids = list(text.encode("utf-8"))  # Start with raw bytes
    
    # Apply each merge rule in the order it was learned
    # Hint: iterate through merges.items() and use your merge function
    pass

Since Python dictionaries preserve insertion order, we can iterate through merges and apply each rule in the order it was learned. To see why order matters, consider: if we learned (116,104)→256 first and (256,101)→257 second, the second rule looks for the pair (256, 101) in the token sequence, but token 256 doesn't exist yet. The rule finds no match and gets skipped.

For example, encoding "the" with merge rules (116,104)→256 (t,h→th) and (256,101)→257 (th,e→the):

Correct Order: #1 then #2
Step 1 · Apply merge #1: (116,104)→256

[116, 104, 101] [256, 101] th, e

Step 2 · Apply merge #2: (256,101)→257

[256, 101] [257] the

Result: [257](1 token)
Wrong Order: #2 then #1
Step 1 · Apply merge #2: (256,101)→257

[116, 104, 101] [116, 104, 101] (no 256 yet, skipped!)

Step 2 · Apply merge #1: (116,104)→256

[116, 104, 101] [256, 101] th, e

Result: [256, 101](2 tokens, missed a merge)

Encoding Example

Encoding "the" to tokens
Start:[116, 104, 101]t, h, e
Merge (116,104):[256, 101]th, e
Merge (256,101):[257]the
Result:[257]← 3 bytes compressed to 1 token
Challenge: The Encoder

Try implementing this in the Challenge tab.

6. Recap

Let's step back and see the complete flow from text to token IDs.

1
Raw Text
"Hello, World!"

The original input string.

2
Regex Split
["Hello", ",", " World", "!"]

Pre-split into chunks. Punctuation stays separate.

3
UTF-8 Encoding
[72, 101, 108, 108, 111], [44], [32, 87, 111, 114, 108, 100], [33]

Each chunk becomes a list of bytes.

4
BPE Merging

Apply learned merge rules to compress bytes into tokens.

5
Token IDs
[15496, 11, 2159, 0]

The final sequence of integers.

Summary
  • Tokenization converts text into a sequence of integer IDs
  • BPE learns merge rules by iteratively combining frequent byte pairs
  • Regex pre-splitting keeps words and punctuation in separate chunks
  • Decoding converts tokens back to text by looking up each token's bytes
  • Encoding converts text to tokens by applying merge rules in order they were learned

We now have a sequence of token IDs. But these are just identifiers. They don't tell us anything about what the tokens mean. In the next chapter, we'll see how Embeddings turn each token ID into a vector that captures meaning.