Tokenization

How BPE compresses text into the atoms of language models.

In the previous chapter, we learned that computers see text as a stream of raw bytes. We also realized that feeding these bytes directly into the model is inefficient. Sequences become too long, and the model wastes capacity recognizing basic patterns like "the" over and over.

We need a way to group bytes into meaningful chunks. This process is called Tokenization, and the industry-standard algorithm, used by GPT-4, Llama, and Claude, is Byte Pair Encoding (BPE).

1. The Core Intuition: Tokenization is Compression

While the name "Byte Pair Encoding" sounds technical, the intuition is surprisingly simple: BPE is a compression algorithm.

Think about how you text. You don't type "Laughing Out Loud" every time. That's 17 characters. Instead, you type "LOL" (3 characters). You've mentally agreed that this sequence appears so frequently it deserves to be a single unit.

"Laughing Out Loud"
17 characters
"LOL"
3 characters

BPE does exactly this, but automatically. It reads through massive amounts of text (like Wikipedia) and asks: "Which sequences of characters appear together most often?"

Common Patterns Learned
t + hthappears constantly together
th + etheone of the most common words
in + gingcommon suffix

By the end of this process, common words like "apple" become single tokens, while rare words remain as smaller chunks. This allows the model to process text much more efficiently.

2. The Algorithm: A Visual Walkthrough

To truly understand BPE, we need to run the algorithm by hand. We'll use a simple sequence to see the mechanics clearly before scaling up to real text.

The Dataset

Imagine we are training a tokenizer on this string:

aaabdaaabac

Iteration 0: The Starting Point

Our vocabulary consists only of individual characters. The sequence has 11 tokens.

a
a
a
b
d
a
a
a
b
a
c

Step 1: Count the Pairs

The algorithm scans every adjacent pair, each token paired with the one right next to it. Here's how to see the pairs:

Adjacent pairs in our sequence (same color = same pair type)
aa
aa
ab
bd
da
aa
aa
ab
ba
ac
(a,a) × 4(a,b) × 2(b,d) × 1(d,a) × 1(b,a) × 1(a,c) × 1
(a,a)
4
(a,b)
2
(b,d)
1
(d,a)
1
(b,a)
1
(a,c)
1

Step 2: The First Merge

The winner is (a, a) with 4 occurrences. The algorithm creates a new token to represent this pair. Let's call it Z. We add Z to our vocabulary and replace every aa in our data.

After Merge #1: (a, a) → Z
Z
a
b
d
Z
a
b
a
c

Length: 9 tokens (down from 11)

Key Insight

We just compressed the data. The information is identical, but the sequence is shorter. This is the core of BPE.

Step 3: Repeat

Now we count pairs again on the new sequence. Both (Z, a) and (a, b) appear twice. The algorithm picks one (let's choose ab). We create token Y.

After Merge #2: (a, b) → Y
Z
Y
d
Z
Y
a
c

Length: 7 tokens (down from 9)

The Result

11
Starting tokens
7
After 2 merges

Scaled to the entire internet, BPE typically achieves 50-60% compression compared to raw bytes.

3. Implementing BPE in Python

Now let's build the actual code. This is the exact logic used inside GPT-2 and GPT-4.

In the walkthrough above, we used letters like a, b, c to make the algorithm easy to follow. But remember what we learned in Chapter 1: text is already a sequence of integers. When you encode "hello" as UTF-8, you get [104, 101, 108, 108, 111]. These byte values range from 0 to 255, giving us our base vocabulary of 256 tokens. When BPE merges a frequent pair, it needs to assign a new ID to that merged token. Since 0-255 are already taken by the raw bytes, new tokens start at 256, then 257, 258, and so on.

Breaking Down the Algorithm

Looking back at our BPE walkthrough, the algorithm repeats three operations: count pairs, pick the most frequent one, and merge it. To implement this, we need three functions that work together:

1
get_stats

Counts how often each adjacent pair appears

2
merge

Replaces all occurrences of a pair with a new token

3
train_bpe

The main loop that calls the above until we reach target vocab size

Let's implement each one.

Function 1: Count Pair Statistics

The first function scans a list of integers and returns a dictionary counting how often each adjacent pair occurs. For example, [1, 2, 3, 1, 2] should return {(1, 2): 2, (2, 3): 1, (3, 1): 1}.

def get_stats(ids: list[int]) -> dict[tuple, int]:
    """
    Given a list of integers, return a dictionary of counts of consecutive pairs.
    Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
    """
    counts = {}
    # Loop through the list and look at each element and its neighbor
    # Your implementation here...
    return counts
Challenge: The Pair Counter

Try implementing this yourself! Switch to the Challenge tab to test your solution.

Function 2: Merge a Pair

The second function takes the token list and replaces every occurrence of a target pair with a new token ID.

def merge(ids: list[int], pair: tuple[int, int], new_id: int) -> list[int]:
    """
    In the list of integers (ids), replace all consecutive occurrences 
    of pair with the new token new_id.
    """
    newids = []
    i = 0
    while i < len(ids):
        # Check if we found the pair (and aren't at the end)
        # If match: append new_id and skip ahead by 2 (we consumed both elements)
        # If no match: append current element and move ahead by 1
        pass
    return newids
Challenge: The Token Merger

This one requires careful index management. Switch to the Challenge tab to try it!

Function 3: The Training Loop

Finally, we combine everything into a training loop. We start with UTF-8 bytes (vocab size 256) and keep merging until we reach a target vocabulary size.

GPT-4, for example, stops at roughly 100,000 tokens. This is a hyperparameter, a number we choose before training to balance efficiency (more tokens = shorter sequences) vs. memory (larger embedding table).

def train_bpe(text: str, num_merges: int) -> tuple[list[int], dict]:
    ids = list(text.encode("utf-8"))  # Start with raw bytes
    merges = {}  # Will store our learned merge rules
    
    for i in range(num_merges):
        # Find the most frequent pair, merge it, record the rule
        # New token IDs start at 256 (after the 0-255 byte range)
        pass
    
    return ids, merges
Challenge: The BPE Trainer

Put it all together! Switch to the Challenge tab to implement the complete training function.

After training, merges contains all the rules needed to tokenize new text, and ids contains your compressed training data.

4. Regex Pre-splitting

If you ran the code above on English text, you'd have a working tokenizer, but it would have a flaw.

Consider the string: "dog dog. dog!"

To a human, the core concept is "dog". But our simple algorithm treats spaces and punctuation as characters to merge. It might see dog. as completely different from dog (with a space). It might even merge g and . into a weird token g..

The Problem
"dog" + space → one pattern
"dog" + period → different pattern
"dog" + exclamation → yet another pattern

The model has to learn "dog" three separate times!

The Fix: Regex Splitting

Modern tokenizers (like GPT-4's) use Regular Expressions to pre-split text into meaningful chunks before applying BPE. They force boundaries between words and punctuation.

The Solution
Input:"dog. dog!"
After regex:
"dog""."" dog""!"

BPE runs on each chunk independently. "dog" is always tokenized as "dog".

This semantic separation helps the model learn language patterns much faster because it doesn't waste capacity learning that "dog." and "dog," are the same word.

5. Encoding & Decoding

We've trained a tokenizer and learned merge rules. Now we need two more functions: one to decode tokens back into text, and one to encode new text into tokens. Let's start with the easier one.

The Vocabulary Table

First, we need a vocabulary table that maps each token ID to its byte sequence. Remember from Chapter 1: text is just a sequence of integers (bytes). The letter "a" is byte 97, "b" is 98, and so on. Our vocabulary stores these byte sequences directly:

The Vocabulary: Token ID → Bytes
# Base vocabulary: single bytes (0-255)
vocab[97] = [97] # "a"
vocab[98] = [98] # "b"
vocab[99] = [99] # "c"
...
# Merged tokens: pre-computed byte sequences
vocab[256] = [97, 97] # "aa"
vocab[257] = [97, 98] # "ab"
vocab[258] = [97, 97, 98] # "aab"
...

The key insight: each token ID points directly to its byte sequence, no calculation needed at runtime. When we see token 256, we just grab [97, 97] from the dictionary. This is why decoding is so fast.

Building the Vocabulary During Training

When you create a new merged token, immediately store its bytes by concatenating the bytes of the first token and the bytes of the second token: vocab[new_id] = vocab[pair[0]] + vocab[pair[1]]. After training, every token’s bytes are ready for instant lookup.

Decoding: Tokens → Text

With a flattened vocabulary, decoding is just a dictionary lookup:

def decode(ids: list[int], vocab: dict[int, list[int]]) -> str:
    """Convert token IDs back to text."""
    # Look up each ID in vocab, concatenate the bytes, decode as UTF-8
    pass
Challenge: The Decoder

An easy win! Switch to the Challenge tab to implement it.

Visual Example: Decoding

Decoding [258, 99] back to text
Token IDs:
258
99
Look up bytes:
vocab[258]
[97, 97, 98]
vocab[99]
[99]
Concatenate:
[97, 97, 98, 99]
Decode UTF-8:
"aabc"

Encoding: Text → Tokens

Encoding involves a few more steps. To tokenize new text, we apply our learned merge rules in the exact order we learned them.

def encode(text: str, merges: dict[tuple[int, int], int]) -> list[int]:
    """Tokenize text using learned merge rules."""
    ids = list(text.encode("utf-8"))  # Start with raw bytes
    
    # Apply each merge rule in the order it was learned
    # Hint: iterate through merges.items() and use your merge function
    pass

Since Python dictionaries preserve insertion order, we can simply iterate through merges and apply each rule. But why does order matter? Some merge rules depend on tokens created by earlier merges. If we learned (97,97)→256 first and (256,98)→257 second, the second rule can only match after the first rule has created token 256. Applying them out of order means the second rule looks for a token that doesn't exist yet, so it gets skipped. Let's see this with an example.

Why Order Matters: Encoding "aab" with Two Merges

We have two merge rules: merge #1: (97,97)→256 and merge #2: (256,98)→257. The order we apply them changes the result.

Correct Order: #1 then #2
Step 1 · Apply merge #1: (97,97)→256

[97, 97, 98] [256, 98]

Step 2 · Apply merge #2: (256,98)→257

[256, 98] [257]

Result: [257](1 token)
Wrong Order: #2 then #1
Step 1 · Apply merge #2: (256,98)→257

[97, 97, 98] [97, 97, 98] (no 256 yet, skipped!)

Step 2 · Apply merge #1: (97,97)→256

[97, 97, 98] [256, 98]

Result: [256, 98](2 tokens, missed merge!)

Visual Example: Encoding

Encoding "aaab" with merges: (97,97)→256, (256,97)→257
Start:[97, 97, 97, 98]← raw bytes: a, a, a, b
Merge (a,a):[256, 97, 98]← (97,97) becomes 256
Merge (256,a):[257, 98]← (256,97) becomes 257
Result:[257, 98]← 4 bytes compressed to 2 tokens
Challenge: The Encoder

The final boss! Implement encoding to tokenize new text. Switch to the Challenge tab.

6. The Complete Pipeline

Let's step back and see the entire journey of text before it reaches the neural network.

1
Raw Text
"Hello World"

The original input string.

2
Regex Split
["Hello", " World"]

Pre-split into meaningful chunks.

3
UTF-8 Encoding
[72, 101, 108, 108, 111], [32, 87, 111, 114, 108, 100]

Convert text to raw bytes.

4
BPE Merging

Apply learned merge rules to compress bytes into tokens.

5
Token IDs
[15496, 995]

This is what the neural network actually sees.

Summary
  • Tokenization compresses text into efficient chunks called tokens
  • BPE learns merge rules by iteratively combining frequent pairs
  • Regex pre-splitting ensures words stay consistent regardless of punctuation
  • Decoding is a simple vocabulary lookup since we store each token's bytes
  • Encoding applies learned merges in order to tokenize new text
  • The final output is a sequence of integer IDs: the atoms of language models

These integers are the only thing the neural network ever sees. They are the atoms of the Large Language Model universe. In the next chapter, we'll see how the model turns these integers into Vectors (Embeddings) to begin understanding their meaning.