In the previous chapter, we learned that computers see text as a stream of raw bytes. We also realized that feeding these bytes directly into the model is inefficient. Sequences become too long, and the model wastes capacity recognizing basic patterns like "the" over and over.
We need a way to group bytes into meaningful chunks. This process is called Tokenization, and the industry-standard algorithm, used by GPT-4, Llama, and Claude, is Byte Pair Encoding (BPE).
1. The Core Intuition: Tokenization is Compression
While the name "Byte Pair Encoding" sounds technical, the intuition is surprisingly simple: BPE is a compression algorithm.
Think about how you text. You don't type "Laughing Out Loud" every time. That's 17 characters. Instead, you type "LOL" (3 characters). You've mentally agreed that this sequence appears so frequently it deserves to be a single unit.
BPE does exactly this, but automatically. It reads through massive amounts of text (like Wikipedia) and asks: "Which sequences of characters appear together most often?"
By the end of this process, common words like "apple" become single tokens, while rare words remain as smaller chunks. This allows the model to process text much more efficiently.
2. The Algorithm: A Visual Walkthrough
To truly understand BPE, we need to run the algorithm by hand. We'll use a simple sequence to see the mechanics clearly before scaling up to real text.
The Dataset
Imagine we are training a tokenizer on this string:
Iteration 0: The Starting Point
Our vocabulary consists only of individual characters. The sequence has 11 tokens.
Step 1: Count the Pairs
The algorithm scans every adjacent pair, each token paired with the one right next to it. Here's how to see the pairs:
Step 2: The First Merge
The winner is (a, a) with 4 occurrences. The algorithm creates a new token to represent this pair. Let's call it Z. We add Z to our vocabulary and replace every aa in our data.
Length: 9 tokens (down from 11)
We just compressed the data. The information is identical, but the sequence is shorter. This is the core of BPE.
Step 3: Repeat
Now we count pairs again on the new sequence. Both (Z, a) and (a, b) appear twice. The algorithm picks one (let's choose ab). We create token Y.
Length: 7 tokens (down from 9)
The Result
Scaled to the entire internet, BPE typically achieves 50-60% compression compared to raw bytes.
3. Implementing BPE in Python
Now let's build the actual code. This is the exact logic used inside GPT-2 and GPT-4.
In the walkthrough above, we used letters like a, b, c to make the algorithm easy to follow. But remember what we learned in Chapter 1: text is already a sequence of integers. When you encode "hello" as UTF-8, you get [104, 101, 108, 108, 111]. These byte values range from 0 to 255, giving us our base vocabulary of 256 tokens. When BPE merges a frequent pair, it needs to assign a new ID to that merged token. Since 0-255 are already taken by the raw bytes, new tokens start at 256, then 257, 258, and so on.
Breaking Down the Algorithm
Looking back at our BPE walkthrough, the algorithm repeats three operations: count pairs, pick the most frequent one, and merge it. To implement this, we need three functions that work together:
Counts how often each adjacent pair appears
Replaces all occurrences of a pair with a new token
The main loop that calls the above until we reach target vocab size
Let's implement each one.
Function 1: Count Pair Statistics
The first function scans a list of integers and returns a dictionary counting how often each adjacent pair occurs. For example, [1, 2, 3, 1, 2] should return {(1, 2): 2, (2, 3): 1, (3, 1): 1}.
def get_stats(ids: list[int]) -> dict[tuple, int]:
"""
Given a list of integers, return a dictionary of counts of consecutive pairs.
Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
"""
counts = {}
# Loop through the list and look at each element and its neighbor
# Your implementation here...
return countsTry implementing this yourself! Switch to the Challenge tab to test your solution.
Function 2: Merge a Pair
The second function takes the token list and replaces every occurrence of a target pair with a new token ID.
def merge(ids: list[int], pair: tuple[int, int], new_id: int) -> list[int]:
"""
In the list of integers (ids), replace all consecutive occurrences
of pair with the new token new_id.
"""
newids = []
i = 0
while i < len(ids):
# Check if we found the pair (and aren't at the end)
# If match: append new_id and skip ahead by 2 (we consumed both elements)
# If no match: append current element and move ahead by 1
pass
return newidsThis one requires careful index management. Switch to the Challenge tab to try it!
Function 3: The Training Loop
Finally, we combine everything into a training loop. We start with UTF-8 bytes (vocab size 256) and keep merging until we reach a target vocabulary size.
GPT-4, for example, stops at roughly 100,000 tokens. This is a hyperparameter, a number we choose before training to balance efficiency (more tokens = shorter sequences) vs. memory (larger embedding table).
def train_bpe(text: str, num_merges: int) -> tuple[list[int], dict]:
ids = list(text.encode("utf-8")) # Start with raw bytes
merges = {} # Will store our learned merge rules
for i in range(num_merges):
# Find the most frequent pair, merge it, record the rule
# New token IDs start at 256 (after the 0-255 byte range)
pass
return ids, mergesPut it all together! Switch to the Challenge tab to implement the complete training function.
After training, merges contains all the rules needed to tokenize new text, and ids contains your compressed training data.
4. Regex Pre-splitting
If you ran the code above on English text, you'd have a working tokenizer, but it would have a flaw.
Consider the string: "dog dog. dog!"
To a human, the core concept is "dog". But our simple algorithm treats spaces and punctuation as characters to merge. It might see dog. as completely different from dog (with a space). It might even merge g and . into a weird token g..
The model has to learn "dog" three separate times!
The Fix: Regex Splitting
Modern tokenizers (like GPT-4's) use Regular Expressions to pre-split text into meaningful chunks before applying BPE. They force boundaries between words and punctuation.
BPE runs on each chunk independently. "dog" is always tokenized as "dog".
This semantic separation helps the model learn language patterns much faster because it doesn't waste capacity learning that "dog." and "dog," are the same word.
5. Encoding & Decoding
We've trained a tokenizer and learned merge rules. Now we need two more functions: one to decode tokens back into text, and one to encode new text into tokens. Let's start with the easier one.
The Vocabulary Table
First, we need a vocabulary table that maps each token ID to its byte sequence. Remember from Chapter 1: text is just a sequence of integers (bytes). The letter "a" is byte 97, "b" is 98, and so on. Our vocabulary stores these byte sequences directly:
The key insight: each token ID points directly to its byte sequence, no calculation needed at runtime. When we see token 256, we just grab [97, 97] from the dictionary. This is why decoding is so fast.
When you create a new merged token, immediately store its bytes by concatenating the bytes of the first token and the bytes of the second token: vocab[new_id] = vocab[pair[0]] + vocab[pair[1]]. After training, every token’s bytes are ready for instant lookup.
Decoding: Tokens → Text
With a flattened vocabulary, decoding is just a dictionary lookup:
def decode(ids: list[int], vocab: dict[int, list[int]]) -> str:
"""Convert token IDs back to text."""
# Look up each ID in vocab, concatenate the bytes, decode as UTF-8
passAn easy win! Switch to the Challenge tab to implement it.
Visual Example: Decoding
Encoding: Text → Tokens
Encoding involves a few more steps. To tokenize new text, we apply our learned merge rules in the exact order we learned them.
def encode(text: str, merges: dict[tuple[int, int], int]) -> list[int]:
"""Tokenize text using learned merge rules."""
ids = list(text.encode("utf-8")) # Start with raw bytes
# Apply each merge rule in the order it was learned
# Hint: iterate through merges.items() and use your merge function
passSince Python dictionaries preserve insertion order, we can simply iterate through merges and apply each rule. But why does order matter? Some merge rules depend on tokens created by earlier merges. If we learned (97,97)→256 first and (256,98)→257 second, the second rule can only match after the first rule has created token 256. Applying them out of order means the second rule looks for a token that doesn't exist yet, so it gets skipped. Let's see this with an example.
Why Order Matters: Encoding "aab" with Two Merges
We have two merge rules: merge #1: (97,97)→256 and merge #2: (256,98)→257. The order we apply them changes the result.
[97, 97, 98] → [256, 98]
[256, 98] → [257]
[97, 97, 98] → [97, 97, 98] (no 256 yet, skipped!)
[97, 97, 98] → [256, 98]
Visual Example: Encoding
The final boss! Implement encoding to tokenize new text. Switch to the Challenge tab.
6. The Complete Pipeline
Let's step back and see the entire journey of text before it reaches the neural network.
The original input string.
Pre-split into meaningful chunks.
Convert text to raw bytes.
Apply learned merge rules to compress bytes into tokens.
This is what the neural network actually sees.
- Tokenization compresses text into efficient chunks called tokens
- BPE learns merge rules by iteratively combining frequent pairs
- Regex pre-splitting ensures words stay consistent regardless of punctuation
- Decoding is a simple vocabulary lookup since we store each token's bytes
- Encoding applies learned merges in order to tokenize new text
- The final output is a sequence of integer IDs: the atoms of language models
These integers are the only thing the neural network ever sees. They are the atoms of the Large Language Model universe. In the next chapter, we'll see how the model turns these integers into Vectors (Embeddings) to begin understanding their meaning.