Learn AI by Building It

Large Language Models are trained on massive datasets of text. Before we can feed this text into a model, we need to understand how computers represent text and how we convert human-readable characters into the numerical format computers use.

When you type "Hello" into ChatGPT, the model doesn't see the word "Hello". It doesn't see letters at all. It sees a sequence of numbers. Every piece of text, in every language, gets converted to numbers before a language model can process it. This chapter covers how that conversion works.

1. Text Representation

Computers only understand binary numbers (0s and 1s). To store or process text, we need a system that assigns a number to each character. As long as everyone agrees on the mapping, computers can store and exchange text reliably.

This consistency is achieved through character encoding standards. Currently, Unicode is the universal standard. It acts as a large lookup table that assigns a unique number, called a Code Point, to every character in almost every language.

Today, Unicode defines over 150,000 characters across 161 scripts. We usually represent these code points with the notation U+XXXX (where XXXX is a hexadecimal number).

Examples of Unicode Code Point Mappings

Character	Description	Unicode Code Point	Decimal Value
A	Latin Capital Letter A	`U+0041`	65
a	Latin Small Letter a	`U+0061`	97
é	Latin 'e' with acute	`U+00E9`	233
ሀ	Amharic Letter Ha	`U+1200`	4608
中	Chinese "Middle"	`U+4E2D`	20,013
😊	Smiling Face	`U+1F60A`	128,522

Important Distinction

Unicode only defines which number respresents each character. It says nothing about how to store that number as bytes in memory. That's a separate decision called encoding, and it matters a lot for efficiency.

2. Encoding

The simplest encoding would be to store every code point as a 4-byte (32-bit) integer, known as UTF-32. This handles any Unicode character easily, since even the largest code points fit in 4 bytes.

However, UTF-32 is space-inefficient for most text. For example, a simple text file of English letters (which only need 1 byte) would be four times larger than necessary. Let's consider storing "Hello" using UTF-32:

Character	Code Point	Stored in UTF-32 (Hex)
H	`U+0048`	`00 00 00 48`
e	`U+0065`	`00 00 00 65`
l	`U+006C`	`00 00 00 6C`
l	`U+006C`	`00 00 00 6C`
o	`U+006F`	`00 00 00 6F`

For code points that don't require 4 bytes, the computer has to pad the remaining space with zeros. So, five characters require 20 bytes, but 15 of those bytes are zeros. That's 75% waste.

For English text (which dominates the internet), UTF-32 uses four times more storage than necessary. This makes UTF-32 impractical for almost all real-world applications.

To solve this problem, we use a smarter encoding.

3. UTF-8: Variable-Width Encoding

UTF-8 solves the storage problem with a clever idea: use fewer bytes for common characters and more bytes only when needed. Instead of wasting 4 bytes on every character, UTF-8 adapts based on the code point's size.

3.1 Byte Ranges by Character Type

In UTF-8, different characters need different amounts of storage. ASCII characters (the basic Latin alphabet, digits, and punctuation) are so common that UTF-8 keeps them as single bytes for maximum efficiency. As characters become less common in English text (accented letters, then Asian scripts, then emoji), they get allocated more bytes. Here's how UTF-8 divides the Unicode space:

1 Byte

ASCII characters

A-Z, a-z, 0-9

U+0000 to U+007F

2 Bytes

Latin extensions, Greek, Cyrillic

é, ñ, Ω, Д

U+0080 to U+07FF

3 Bytes

CJK, Amharic, most scripts

中, 日, ㄱ, ሀ

U+0800 to U+FFFF

4 Bytes

Emojis, rare scripts

😊, 🚀, 𓀀

U+10000 to U+10FFFF

With UTF-8, "Hello" takes just 5 bytes instead of 20. English text is as compact as it was with ASCII, while still supporting every character in Unicode.

3.2 Fixed vs Variable Width: The Parsing Challenge

Variable-width encoding creates a problem that fixed-width encodings don't have: how do you know where one character ends and the next begins? With UTF-32, this is trivial: every character is exactly 4 bytes, so you just read bytes in fixed chunks of four. Let's see the difference visually.

UTF-32: Fixed-width parsing

Byte stream for "Aé中":

00000041|000000E9|00004E2D

Every 4 bytes = 1 character. Simple.

But with UTF-8, characters have variable lengths (1, 2, or 3 bytes in this example). Looking at the same string encoded in UTF-8, we see fewer bytes total, but no obvious boundaries:

UTF-8: Variable-width... but where are the boundaries?

Byte stream for "Aé中":

41C3A9E4B8AD

6 bytes total, but which bytes belong to which character?

This is the core challenge: when reading a stream of UTF-8 bytes, how does a computer know where one character ends and another begins? UTF-8 solves this with a clever bit pattern system.

3.3 The UTF-8 Bit Patterns

UTF-8's brilliant solution is to embed the length information directly into each byte. The first few bits of every byte follow a strict pattern that tells you exactly what role that byte plays. Think of it like a prefix code: before the actual character data, each byte announces "I'm the start of a 2-byte character" or "I'm a continuation of the previous character."

Here's the complete pattern. Pay attention to how the leading bits change:

Byte Count	First Byte Pattern	Continuation Bytes	Bits for Data
1 byte	0xxxxxxx	-	7 bits
2 bytes	110xxxxx	10xxxxxx	11 bits
3 bytes	1110xxxx	10xxxxxx × 2	16 bits
4 bytes	11110xxx	10xxxxxx × 3	21 bits

The rule is elegantly simple: count the leading 1s in the first byte to know how many bytes the character uses. A leading 0 means a single byte. 110 means two bytes. 1110 means three. 11110 means four. Continuation bytes always start with 10, so they can never be mistaken for the start of a character.

Key Insight

This pattern ensures you can never confuse bytes. If you see a byte starting with 10, you know it's a continuation. Keep reading backward or forward to find the leading byte. If you see 0, 110, 1110, or 11110, you're at the start of a character and know exactly how many bytes to read.

3.4 Decoding a Byte Stream

Now let's apply these patterns to actually decode the UTF-8 byte stream we saw earlier. We'll go through each byte, check its leading bits to determine its role, and reconstruct the original characters. This is exactly what your computer does every time it reads a text file.

Bytes: 41 C3 A9 E4 B8 AD

Byte 1: 41 = 01000001 → starts with 0, so it's a 1-byte char = "A"

Byte 2: C3 = 11000011 → starts with 110, so read 2 bytes total

Byte 3: A9 = 10101001 → continuation byte C3 A9 = "é"

Byte 4: E4 = 11100100 → starts with 1110, so read 3 bytes total

Byte 5: B8 = 10111000 → continuation byte

Byte 6: AD = 10101101 → continuation byte E4 B8 AD = "中"

And there we have it! The boundaries are now clear: [41] [C3 A9] [E4 B8 AD] → "A", "é", "中". Notice how we never had to guess. Each byte's leading bits told us exactly what to do. This self-describing property is what makes UTF-8 so robust: even if data gets corrupted or you start reading from the middle of a file, you can always find the next valid character boundary.

3.5 Encoding a Character to Bytes

We've seen how to decode UTF-8, but how does the encoding process work in reverse? When your computer needs to save a character to a file, it must convert the Unicode code point into the correct sequence of bytes. Let's walk through this process step by step.

We'll encode é (U+00E9, decimal 233). First, we check which byte range it falls into: since 233 is larger than 127 but smaller than 2048, it needs 2 bytes. Now we need to split its bits and insert them into the UTF-8 template.

Step 1: Convert code point to binary

233 in decimal = 11101001 in binary (8 bits)

Step 2: Split bits according to 2-byte template

Template: 110xxxxx 10xxxxxx (5 bits + 6 bits = 11 bits)

We need to fit: 11101001 (pad to 11 bits → 00011101001)

Split: 00011 and 101001

Step 3: Insert into UTF-8 template

Byte 1: 11000011 = 11000011 = 195

Byte 2: 10101001 = 10101001 = 169

Result

"é" → [195, 169]

4. Complete Example

Let's trace through how "Hi 👋" gets converted to bytes. This is the exact process that happens before text reaches a language model.

Step 1: Look up each character's code point

HU+0048 (decimal 72)1-byte
iU+0069 (decimal 105)1-byte
(space)U+0020 (decimal 32)1-byte
👋U+1F44B (decimal 128,075)4-byte

Step 2: Encode each code point to bytes

The first three characters have code points below 128, so they map directly to single bytes: 72, 105, 32.

The emoji's code point (128,075) requires 4-byte encoding. Using the same process we showed for "é", the bits get distributed across four bytes following the 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx template, producing: 240, 159, 145, 139.

Step 3: The final byte sequence

What the model sees

0x48

105

0x69

0x20

240

0xF0

159

0x9F

145

0x91

139

0x8B

Four characters became seven bytes. This sequence of integers is what text processing systems, including language models, actually receive as input.

5. Why Not Train on Raw Bytes?

We have just spent this entire chapter establishing that UTF-8 is the universal standard for text. It can represent every language, every emoji, and every symbol using just 256 basic units (bytes).

A very natural question arises: Why don't we just feed these raw bytes directly into the model?

It seems like the perfect solution. It's Universal: you would never need to update your model for new languages since Spanish, Chinese, and Python code are all just bytes. It's also Tiny: the model would only need a vocabulary of 256 items, which is incredibly memory efficient compared to storing hundreds of thousands of words.

Given how elegant this solution seems, could we just train a model on raw bytes? Let's think through the trade-offs:

Think About the Trade-off

Take a moment to think before revealing the answer.

6. Next: Tokenization

The algorithm that finds this middle ground is called Byte Pair Encoding (BPE). It starts with individual bytes and iteratively merges the most frequent adjacent pairs. After training on a large text corpus, common patterns like "ing", "tion", and "the" become single units called tokens, while rare words get split into smaller recognizable pieces.

Preview: How GPT-4 tokenizes text

"Hello" →[Hello](1 token)

"Unbelievable" →[Un][believ][able](3 tokens)

Summary

Unicode assigns a unique code point to every character in every writing system
UTF-8 encodes those code points as variable-length byte sequences (1-4 bytes)
Raw bytes are universal but create sequences that are too long for efficient processing
Full words keep sequences short but require impossibly large vocabularies
We need something in between: tokens

In the next chapter, we'll build BPE from scratch and see exactly how models like GPT-4 convert text into the units they actually process.