When you send a prompt to an LLM, the model doesn't see words or letters. It sees a sequence of numbers. Every piece of text, in every language, gets converted to numbers before a language model can process it. This chapter covers how that conversion works.
1. Text Representation
Computers only understand binary numbers (0s and 1s). To store or process text, we need a system that assigns a number to each character. As long as everyone agrees on the mapping, computers can use it to store and exchange text reliably.
This consistency is achieved through Unicode, the universal standard for text. It acts as a large lookup table that assigns a unique number, called a Code Point, to every character in almost every language. Today, Unicode defines over 150,000 characters across 161 scripts. We usually represent these code points with the notation U+XXXX (where XXXX is a hexadecimal number).
Before Unicode, every region invented its own encoding. Americans used ASCII, Russians used KOI8, Japanese used Shift-JIS. A file written in one encoding looked like garbage when opened with another. Unicode solved this by providing a single universal standard.
| Character | Description | Unicode Code Point | Decimal Value |
|---|---|---|---|
| A | Latin Capital Letter A | U+0041 | 65 |
| a | Latin Small Letter a | U+0061 | 97 |
| Γ© | Latin 'e' with acute | U+00E9 | 233 |
| α | Amharic Letter Ha | U+1200 | 4608 |
| δΈ | Chinese "Middle" | U+4E2D | 20,013 |
| π | Smiling Face | U+1F60A | 128,522 |
Unicode only defines which number represents each character. It says nothing about how to store that number as bytes in memory. That's a separate decision called encoding, and it matters a lot for efficiency.
2. Encoding
The simplest encoding would be to store every code point as a 4-byte (32-bit) integer, known as UTF-32. This handles any Unicode character easily, since even the largest code points fit in 4 bytes.
However, UTF-32 is space-inefficient for most text. For example, a simple text file of English letters (which only need 1 byte) would be four times larger than necessary. Let's consider storing "Hello" using UTF-32:
| Character | Code Point | Stored in UTF-32 (Hex) |
|---|---|---|
| H | U+0048 | 00 00 00 48 |
| e | U+0065 | 00 00 00 65 |
| l | U+006C | 00 00 00 6C |
| l | U+006C | 00 00 00 6C |
| o | U+006F | 00 00 00 6F |
For code points that don't require 4 bytes, the computer has to pad the remaining space with zeros. So, five characters require 20 bytes, but 15 of those bytes are zeros. That's 75% waste.
For English text (which dominates the internet), UTF-32 uses four times more storage than necessary. This makes UTF-32 impractical for almost all real-world applications, so we need a more efficient encoding.
3. UTF-8
UTF-8 solves the storage problem by using fewer bytes for common characters and more bytes only when needed, instead of wasting 4 bytes on every character.
3.1 How UTF-8 Allocates Bytes
In UTF-8, different characters need different amounts of storage. ASCII characters (the basic Latin alphabet, digits, and punctuation) are so common that UTF-8 keeps them as single bytes for maximum efficiency. As characters become less common in English text (accented letters, then Asian scripts, then emoji), they get allocated more bytes. Here's how UTF-8 divides the Unicode space:
With UTF-8, "Hello" takes just 5 bytes instead of 20. English text is as compact as it was with ASCII, while still supporting every character in Unicode. Try different text in the interactive comparison below to see how the byte counts change:
| Char |
| UTF-32 |
| UTF-8 |
3.2 Finding Character Boundaries
Variable-width encoding creates a problem that fixed-width encodings don't have. How do you know where one character ends and the next begins? With UTF-32, every character is exactly 4 bytes, so you just read in fixed chunks of four. But with UTF-8, characters vary in length (1 to 4 bytes).
UTF-32: Fixed-width parsing
With UTF-8, the same string uses fewer bytes, but the boundaries aren't obvious:
UTF-8: Variable-width... but where are the boundaries?
This is the core challenge with variable-width encoding.
3.3 How UTF-8 Marks Boundaries
UTF-8 solves the boundary problem by embedding length information directly into each byte's leading bits. The number of leading 1s tells you how many bytes the character uses: a leading 0 means one byte, 110 means two, 1110 means three, and 11110 means four. Continuation bytes always start with 10, so they can never be mistaken for the start of a new character.
Here's the complete pattern:
| Byte Count | First Byte Pattern | Continuation Bytes | Bits for Data |
|---|---|---|---|
| 1 byte | 0xxxxxxx | - | 7 bits |
| 2 bytes | 110xxxxx | 10xxxxxx | 11 bits |
| 3 bytes | 1110xxxx | 10xxxxxx Γ 2 | 16 bits |
| 4 bytes | 11110xxx | 10xxxxxx Γ 3 | 21 bits |
Any byte in a UTF-8 stream is self-identifying. A 10 prefix means continuation byte. A 0, 110, 1110, or 11110 prefix means start of a character and tells you exactly how many bytes to read. You can jump to any position in a file and find the nearest character boundary.
3.4 Decoding
Decoding converts bytes back into characters. For the UTF-8 byte stream from earlier, we read each byte's leading bits to determine whether it starts a new character or continues the previous one.
41 = 01000001 β starts with 0 β single byte
C3 = 11000011 β two 1s β read 2 bytes
A9 = 10101001 β continuation
E4 = 11100100 β three 1s β read 3 bytes
B8 = 10111000 β continuation
AD = 10101101 β continuation
The boundaries are now clear: [41] [C3 A9] [E4 B8 AD] β "A", "Γ©", "δΈ". Each byte's leading bits told us exactly what to do, with no guessing required.
3.5 Encoding
Encoding works in the opposite direction. Given a Unicode code point, we determine how many bytes it needs, then split its bits into the appropriate UTF-8 template.
Let's encode Γ© (U+00E9, decimal 233). First, we determine how many bytes it needs by checking its value: 233 falls between 128 and 2047, so it requires the 2-byte template.
Step 1: Convert code point to binary
233 in decimal = 11101001 in binary (8 bits)
Step 2: Count the available data slots
Each byte reserves some bits for the prefix, leaving the rest as "slots" for our actual data:
Step 3: Pad the binary to fill all slots
Our binary 11101001 has only 8 bits, so we pad it with leading zeros to fill all 11 slots:
Step 4: Split and add prefix bits
We split from the right to fill each byte's slots, then prepend the prefix bits:
Result
"Γ©" β [195 (0xC3), 169 (0xA9)]
4. Complete Example
Let's trace through how "Hi π" gets converted to bytes.
Step 1: Look up each character's code point
- HU+0048 (72)1-byte
- iU+0069 (105)1-byte
- (space)U+0020 (32)1-byte
- πU+1F44B (128075)4-byte
Step 2: Encode to bytes
The first three characters have code points below 128, so they map directly to single bytes: 72, 105, 32. The emoji (code point 128075) requires 4-byte encoding. Following the same process from section 3.5, we use the 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx template, producing: 240, 159, 145, 139.
Step 3: The final byte sequence
Our four characters became seven bytes. You can verify this with list("Hi π".encode('utf-8')) in Python.
5. Interactive: Character to Bytes
6. Why Not Train on Raw Bytes?
We've now seen how UTF-8 converts any text into a sequence of bytes. Every language, every emoji, every symbol reduces to just 256 possible byte values. So why not feed these raw bytes directly into an AI model?
The Problem with Bytes
The issue isn't what the model can read, but how much attention it has to spend reading it. Imagine someone telling you a story by pronouncing every letter:
"T... H... E... Q... U... I... C... K... B... R... O... W... N... F... O... X..."
By the time they reach "F-O-X," you've forgotten the beginning. Your brain is so busy assembling letters that it has no capacity left for meaning.
AI models face a similar challenge. They process text sequentially, one unit at a time, and can only hold so many units in memory. This limit is called the Context Window. When each byte is a separate unit, sequences become very long. For example, "Artificial Intelligence" is just 2 words to us, but 23 separate bytes for the model. Most of its capacity gets spent on low-level details instead of meaning.
What About Words?
The opposite extreme is to give every word its own ID: "Apple" = 101, "Banana" = 102. This keeps sequences short, but creates new problems. First, the vocabulary becomes huge. To cover all languages, names, and technical terms, we'd need millions of IDs, consuming massive amounts of memory during training. Second, the vocabulary is rigid. When a new term like "GPT-8" is invented, the model has no ID for it and fails to process it.
The Solution: Tokens
So bytes give us sequences that are too long, and words give us vocabularies that are too large. The middle ground is to start with bytes and merge adjacent ones based on frequency. For instance, the bytes for 't', 'h', and 'e' appear together so often that it's efficient to merge them into a single unit, called a token. Same with common words like "apple", "run", or suffixes like "ing". Rare words like "cryptocurrency" stay as separate pieces like "crypto" + "currency".
This gives us the best of both worlds. Common text stays compact and efficient, while any new or rare word can be built from smaller, known pieces.
7. Next: Tokenization
We now have the right concept: common words should be single units, and rare words should be split. This process of chunking text into pieces is called Tokenization, and the resulting pieces are called tokens.
Preview: Tokenization in action
- Unicode assigns a unique number (code point) to every character
- UTF-8 encodes those numbers efficiently into bytes
- Raw bytes are too granular (wasting model attention)
- Words are too sparse (huge vocabularies)
- Tokens are the middle ground: frequent patterns merged into single units
In the next chapter, we'll build the algorithm that finds these optimal tokens: Byte Pair Encoding (BPE).