Learn AI by Building It

When you send a prompt to an LLM, the model doesn't see words or letters. It sees a sequence of numbers. Every piece of text, in every language, gets converted to numbers before a language model can process it. This chapter covers how that conversion works.

1. Text Representation

Computers only understand binary numbers (0s and 1s). To store or process text, we need a system that assigns a number to each character. As long as everyone agrees on the mapping, computers can use it to store and exchange text reliably.

Unicode provides this consistency as the universal standard for text. It acts as a large lookup table that assigns a unique number, called a Code Point, to every character in almost every language. Today, Unicode defines over 150,000 characters across 161 scripts. We usually represent these code points with the notation U+XXXX (where XXXX is a hexadecimal number).

Before Unicode, every region invented its own encoding. Americans used ASCII, Russians used KOI8, Japanese used Shift-JIS. A file written in one encoding looked like garbage when opened with another. Unicode solved this by providing a single universal standard.

Examples of Unicode Code Point Mappings

Character	Description	Unicode Code Point	Decimal Value
A	Latin Capital Letter A	`U+0041`	65
a	Latin Small Letter a	`U+0061`	97
é	Latin 'e' with acute	`U+00E9`	233
ሀ	Amharic Letter Ha	`U+1200`	4608
中	Chinese "Middle"	`U+4E2D`	20,013
😊	Smiling Face	`U+1F60A`	128,522

Important Distinction

Unicode only defines which number represents each character. It says nothing about how to store that number as bytes in memory. That's a separate decision called encoding, and it matters a lot for efficiency.

2. Encoding

The simplest encoding would be to store every code point as a 4-byte (32-bit) integer, known as UTF-32. This handles any Unicode character easily, since even the largest code points fit in 4 bytes.

However, UTF-32 is space-inefficient for most text. For example, a simple text file of English letters, which only need 1 byte each, would be four times larger than necessary. Let's consider storing "Hello" using UTF-32:

Character	Code Point	Stored in UTF-32 (Hex)
H	`U+0048`	`00 00 00 48`
e	`U+0065`	`00 00 00 65`
l	`U+006C`	`00 00 00 6C`
l	`U+006C`	`00 00 00 6C`
o	`U+006F`	`00 00 00 6F`

For code points that don't require 4 bytes, the computer has to pad the remaining space with zeros. So, five characters require 20 bytes, but 15 of those bytes are zeros. That's 75% waste.

For English text, which makes up most of the internet, UTF-32 uses four times more storage than necessary. This makes UTF-32 impractical for almost all real-world applications, so we need a more efficient encoding.

3. UTF-8

UTF-8 solves the storage problem by using fewer bytes for common characters and more bytes only when needed.

3.1 How UTF-8 Allocates Bytes

In UTF-8, different characters need different amounts of storage. ASCII characters (the basic Latin alphabet, digits, and punctuation) are so common that UTF-8 keeps them as single bytes for maximum efficiency. As characters become less common in English text, they get allocated more bytes. Here's how UTF-8 divides the Unicode space:

1 Byte

ASCII characters

A-Z, a-z, 0-9

U+0000 to U+007F

2 Bytes

Latin extensions, Greek, Cyrillic

é, ñ, Ω, Д

U+0080 to U+07FF

3 Bytes

CJK, Amharic, most scripts

中, 日, ㄱ, ሀ

U+0800 to U+FFFF

4 Bytes

Emojis, rare scripts

😊, 🚀, 𓀀

U+10000 to U+10FFFF

With UTF-8, "Hello" takes just 5 bytes instead of 20. English text is as compact as it was with ASCII, while still supporting every character in Unicode. You can try different text in the interactive comparison below to see how the byte counts change.

Compare encodings

5/5

UTF-32 bytes

UTF-8 bytes

75%

smaller

Char	H	e	l	l	o
UTF-32	00000048	00000065	0000006C	0000006C	0000006F
UTF-8	48	65	6C	6C	6F

3.2 Finding Character Boundaries

Variable-width encoding creates a problem that fixed-width encodings don't have. How do you know where one character ends and the next begins? With UTF-32, every character is exactly 4 bytes, so you just read in fixed chunks of four. But with UTF-8, characters vary in length (1 to 4 bytes).

UTF-32: Fixed-width parsing

Byte stream for "Aé中":

00000041|000000E9|00004E2D

Every 4 bytes = 1 character.

With UTF-8, the same string uses fewer bytes, but the boundaries aren't obvious:

UTF-8: Variable-width... but where are the boundaries?

Byte stream for "Aé中":

41C3A9E4B8AD

6 bytes total, but which bytes belong to which character?

Finding where one character ends and the next begins is the core challenge with variable-width encoding.

3.3 How UTF-8 Marks Boundaries

UTF-8 solves the boundary problem by embedding length information directly into each byte's leading bits. The number of leading 1s tells you how many bytes the character uses: a leading 0 means one byte, 110 means two, 1110 means three, and 11110 means four. Continuation bytes always start with 10, so they can never be mistaken for the start of a new character.

Here's the complete pattern:

Byte Count	First Byte Pattern	Continuation Bytes	Bits for Data
1 byte	0xxxxxxx	-	7 bits
2 bytes	110xxxxx	10xxxxxx	11 bits
3 bytes	1110xxxx	10xxxxxx × 2	16 bits
4 bytes	11110xxx	10xxxxxx × 3	21 bits

Key Insight

Any byte in a UTF-8 stream is self-identifying. A 10 prefix means continuation byte. A 0, 110, 1110, or 11110 prefix means start of a character and tells you exactly how many bytes to read. You can jump to any position in a file and find the nearest character boundary.

3.4 Decoding

Decoding converts bytes back into characters. For the UTF-8 byte stream from earlier, we read each byte's leading bits to determine whether it starts a new character or continues the previous one.

Bytes: 41 C3 A9 E4 B8 AD

41 = 01000001 ← starts with 0 → single byte

→A

C3 = 11000011 ← two 1s → read 2 bytes

A9 = 10101001 ← continuation

→é

E4 = 11100100 ← three 1s → read 3 bytes

B8 = 10111000 ← continuation

AD = 10101101 ← continuation

→中

The boundaries are now clear: [41] [C3 A9] [E4 B8 AD] → "A", "é", "中". Each byte's leading bits told us exactly what to do, with no guessing required.

3.5 Encoding

Encoding works in the opposite direction. Given a Unicode code point, we determine how many bytes it needs, then split its bits into the appropriate UTF-8 template.

Let's encode é (U+00E9, decimal 233). First, we determine how many bytes it needs by checking its value: 233 falls between 128 and 2047, so it requires the 2-byte template.

Step 1: Convert code point to binary

233 in decimal = 11101001 in binary (8 bits)

Step 2: Count the available data slots

Each byte reserves some bits for the prefix, leaving the rest as "slots" for our actual data:

Byte 1: 110xxxxx → 5 data slots

Byte 2: 10xxxxxx → 6 data slots

Total: 5 + 6 = 11 data slots to fill

Step 3: Pad the binary to fill all slots

Our binary 11101001 has only 8 bits, so we pad it with leading zeros to fill all 11 slots:

11101001 (8 bits) → 00011101001 (11 bits)

Step 4: Split and add prefix bits

We split from the right to fill each byte's slots, then prepend the prefix bits:

Split from right: 00011 (5 bits) | 101001 (6 bits)

Byte 1: 110 + 00011 = 11000011 = 195 (0xC3)

Byte 2: 10 + 101001 = 10101001 = 169 (0xA9)

Result

"é" → [195 (0xC3), 169 (0xA9)]

4. Complete Example

Let's trace through how "Hi 👋" gets converted to bytes.

Step 1: Look up each character's code point

HU+0048 (72)1-byte
iU+0069 (105)1-byte
(space)U+0020 (32)1-byte
👋U+1F44B (128075)4-byte

Step 2: Encode to bytes

The first three characters have code points below 128, so they map directly to single bytes: 72, 105, 32. The emoji (code point 128075) requires 4-byte encoding. Following the same process from section 3.5, we use the 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx template, producing: 240, 159, 145, 139.

Step 3: The final byte sequence

0x48

105

0x69

0x20

240

0xF0

159

0x9F

145

0x91

139

0x8B

Our four characters became seven bytes. You can verify this with list("Hi 👋".encode('utf-8')) in Python.

5. Interactive: Character to Bytes

Type any text to see its bytes

4/15

4 characters→7 bytes

U+0048

0x48

U+0069

105

0x69

␣

U+0020

0x20

👋

U+1F44B

240

0xF0

159

0x9F

145

0x91

139

0x8B

6. Why Not Train on Raw Bytes?

We've now seen how UTF-8 converts any text into a sequence of bytes. Every language, every emoji, every symbol reduces to just 256 possible byte values. So why not feed these raw bytes directly into an AI model?

The Problem with Bytes

The issue isn't what the model can read, but how much attention it has to spend reading it. Imagine someone telling you a story by pronouncing every letter:

"T... H... E... Q... U... I... C... K... B... R... O... W... N... F... O... X..."

By the time they reach "F-O-X," you've forgotten the beginning. Your brain is so busy assembling letters that it has no capacity left for meaning.

AI models face a similar challenge. They process text sequentially, one unit at a time, and can only hold so many units in memory. This limit is called the Context Window. When each byte is a separate unit, sequences become very long. For example, "Artificial Intelligence" is just 2 words to us, but 23 separate bytes for the model. Most of its capacity gets spent on low-level details instead of meaning.

What About Words?

The opposite extreme is to give every word its own ID: "Apple" = 101, "Banana" = 102. This keeps sequences short, but creates new problems. First, the vocabulary becomes huge. To cover all languages, names, and technical terms, we'd need millions of IDs, consuming massive amounts of memory during training. Second, the vocabulary is rigid. When a new term like "GPT-8" is invented, the model has no ID for it and fails to process it.

The Solution: Tokens

So bytes give us sequences that are too long, and words give us vocabularies that are too large. The middle ground is to start with bytes and merge adjacent ones based on frequency. For instance, the bytes for 't', 'h', and 'e' appear together so often that it's efficient to merge them into a single unit, called a token. Same with common words like "apple", "run", or suffixes like "ing". Rare words like "cryptocurrency" stay as separate pieces like "crypto" + "currency".

This gives us the best of both worlds. Common text stays compact and efficient, while any new or rare word can be built from smaller, known pieces.

7. Next: Tokenization

We now have the right concept: common words should be single units, and rare words should be split. This process of chunking text into pieces is called Tokenization, and the resulting pieces are called tokens.

Preview: Tokenization in action

"Hello" →[Hello](1 token)

"Unbelievable" →[Un][believ][able](3 tokens)

Summary

Unicode assigns a unique number (code point) to every character
UTF-8 encodes those numbers efficiently into bytes
Raw bytes are too granular (wasting model attention)
Words are too sparse (huge vocabularies)
Tokens are the middle ground: frequent patterns merged into single units

In the next chapter, we'll build the algorithm that finds these optimal tokens: Byte Pair Encoding (BPE).