From Text to Bytes

How computers represent language (Unicode & UTF-8).

When you send a prompt to an LLM, the model doesn't see words or letters. It sees a sequence of numbers. Every piece of text, in every language, gets converted to numbers before a language model can process it. This chapter covers how that conversion works.

1. Text Representation

Computers only understand binary numbers (0s and 1s). To store or process text, we need a system that assigns a number to each character. As long as everyone agrees on the mapping, computers can use it to store and exchange text reliably.

This consistency is achieved through Unicode, the universal standard for text. It acts as a large lookup table that assigns a unique number, called a Code Point, to every character in almost every language. Today, Unicode defines over 150,000 characters across 161 scripts. We usually represent these code points with the notation U+XXXX (where XXXX is a hexadecimal number).

Before Unicode, every region invented its own encoding. Americans used ASCII, Russians used KOI8, Japanese used Shift-JIS. A file written in one encoding looked like garbage when opened with another. Unicode solved this by providing a single universal standard.

Examples of Unicode Code Point Mappings
CharacterDescriptionUnicode Code PointDecimal Value
ALatin Capital Letter AU+004165
aLatin Small Letter aU+006197
Γ©Latin 'e' with acuteU+00E9233
αˆ€Amharic Letter HaU+12004608
δΈ­Chinese "Middle"U+4E2D20,013
😊Smiling FaceU+1F60A128,522
Important Distinction

Unicode only defines which number represents each character. It says nothing about how to store that number as bytes in memory. That's a separate decision called encoding, and it matters a lot for efficiency.

2. Encoding

The simplest encoding would be to store every code point as a 4-byte (32-bit) integer, known as UTF-32. This handles any Unicode character easily, since even the largest code points fit in 4 bytes.

However, UTF-32 is space-inefficient for most text. For example, a simple text file of English letters (which only need 1 byte) would be four times larger than necessary. Let's consider storing "Hello" using UTF-32:

CharacterCode PointStored in UTF-32 (Hex)
HU+004800 00 00 48
eU+006500 00 00 65
lU+006C00 00 00 6C
lU+006C00 00 00 6C
oU+006F00 00 00 6F

For code points that don't require 4 bytes, the computer has to pad the remaining space with zeros. So, five characters require 20 bytes, but 15 of those bytes are zeros. That's 75% waste.

For English text (which dominates the internet), UTF-32 uses four times more storage than necessary. This makes UTF-32 impractical for almost all real-world applications, so we need a more efficient encoding.

3. UTF-8

UTF-8 solves the storage problem by using fewer bytes for common characters and more bytes only when needed, instead of wasting 4 bytes on every character.

3.1 How UTF-8 Allocates Bytes

In UTF-8, different characters need different amounts of storage. ASCII characters (the basic Latin alphabet, digits, and punctuation) are so common that UTF-8 keeps them as single bytes for maximum efficiency. As characters become less common in English text (accented letters, then Asian scripts, then emoji), they get allocated more bytes. Here's how UTF-8 divides the Unicode space:

1 Byte
ASCII characters
A-Z, a-z, 0-9
U+0000 to U+007F
2 Bytes
Latin extensions, Greek, Cyrillic
Γ©, Γ±, Ξ©, Π”
U+0080 to U+07FF
3 Bytes
CJK, Amharic, most scripts
δΈ­, ζ—₯, γ„±, αˆ€
U+0800 to U+FFFF
4 Bytes
Emojis, rare scripts
😊, πŸš€, π“€€
U+10000 to U+10FFFF

With UTF-8, "Hello" takes just 5 bytes instead of 20. English text is as compact as it was with ASCII, while still supporting every character in Unicode. Try different text in the interactive comparison below to see how the byte counts change:

Compare encodings
5/5
0
UTF-32 bytes
vs
0
UTF-8 bytes
Char
UTF-32
UTF-8

3.2 Finding Character Boundaries

Variable-width encoding creates a problem that fixed-width encodings don't have. How do you know where one character ends and the next begins? With UTF-32, every character is exactly 4 bytes, so you just read in fixed chunks of four. But with UTF-8, characters vary in length (1 to 4 bytes).

UTF-32: Fixed-width parsing
Byte stream for "AΓ©δΈ­":
00000041|000000E9|00004E2D
Every 4 bytes = 1 character.

With UTF-8, the same string uses fewer bytes, but the boundaries aren't obvious:

UTF-8: Variable-width... but where are the boundaries?
Byte stream for "AΓ©δΈ­":
41C3A9E4B8AD
6 bytes total, but which bytes belong to which character?

This is the core challenge with variable-width encoding.

3.3 How UTF-8 Marks Boundaries

UTF-8 solves the boundary problem by embedding length information directly into each byte's leading bits. The number of leading 1s tells you how many bytes the character uses: a leading 0 means one byte, 110 means two, 1110 means three, and 11110 means four. Continuation bytes always start with 10, so they can never be mistaken for the start of a new character.

Here's the complete pattern:

Byte CountFirst Byte PatternContinuation BytesBits for Data
1 byte0xxxxxxx-7 bits
2 bytes110xxxxx10xxxxxx11 bits
3 bytes1110xxxx10xxxxxx Γ— 216 bits
4 bytes11110xxx10xxxxxx Γ— 321 bits
Key Insight

Any byte in a UTF-8 stream is self-identifying. A 10 prefix means continuation byte. A 0, 110, 1110, or 11110 prefix means start of a character and tells you exactly how many bytes to read. You can jump to any position in a file and find the nearest character boundary.

3.4 Decoding

Decoding converts bytes back into characters. For the UTF-8 byte stream from earlier, we read each byte's leading bits to determine whether it starts a new character or continues the previous one.

Bytes: 41 C3 A9 E4 B8 AD

41 = 01000001 ← starts with 0 β†’ single byte

β†’A

C3 = 11000011 ← two 1s β†’ read 2 bytes

A9 = 10101001 ← continuation

β†’Γ©

E4 = 11100100 ← three 1s β†’ read 3 bytes

B8 = 10111000 ← continuation

AD = 10101101 ← continuation

β†’δΈ­

The boundaries are now clear: [41] [C3 A9] [E4 B8 AD] β†’ "A", "Γ©", "δΈ­". Each byte's leading bits told us exactly what to do, with no guessing required.

3.5 Encoding

Encoding works in the opposite direction. Given a Unicode code point, we determine how many bytes it needs, then split its bits into the appropriate UTF-8 template.

Let's encode Γ© (U+00E9, decimal 233). First, we determine how many bytes it needs by checking its value: 233 falls between 128 and 2047, so it requires the 2-byte template.

Step 1: Convert code point to binary

233 in decimal = 11101001 in binary (8 bits)

Step 2: Count the available data slots

Each byte reserves some bits for the prefix, leaving the rest as "slots" for our actual data:

Byte 1: 110xxxxx β†’ 5 data slots
Byte 2: 10xxxxxx β†’ 6 data slots
Total: 5 + 6 = 11 data slots to fill
Step 3: Pad the binary to fill all slots

Our binary 11101001 has only 8 bits, so we pad it with leading zeros to fill all 11 slots:

11101001 (8 bits) β†’ 00011101001 (11 bits)
Step 4: Split and add prefix bits

We split from the right to fill each byte's slots, then prepend the prefix bits:

Split from right: 00011 (5 bits) | 101001 (6 bits)
Byte 1: 110 + 00011 = 11000011 = 195 (0xC3)
Byte 2: 10 + 101001 = 10101001 = 169 (0xA9)
Result

"Γ©" β†’ [195 (0xC3), 169 (0xA9)]

4. Complete Example

Let's trace through how "Hi πŸ‘‹" gets converted to bytes.

Step 1: Look up each character's code point

  • HU+0048 (72)1-byte
  • iU+0069 (105)1-byte
  • (space)U+0020 (32)1-byte
  • πŸ‘‹U+1F44B (128075)4-byte

Step 2: Encode to bytes

The first three characters have code points below 128, so they map directly to single bytes: 72, 105, 32. The emoji (code point 128075) requires 4-byte encoding. Following the same process from section 3.5, we use the 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx template, producing: 240, 159, 145, 139.

Step 3: The final byte sequence

72
0x48
105
0x69
32
0x20
240
0xF0
159
0x9F
145
0x91
139
0x8B

Our four characters became seven bytes. You can verify this with list("Hi πŸ‘‹".encode('utf-8')) in Python.

5. Interactive: Character to Bytes

Type any text to see its bytes
4/15
4 characters→0 bytes

6. Why Not Train on Raw Bytes?

We've now seen how UTF-8 converts any text into a sequence of bytes. Every language, every emoji, every symbol reduces to just 256 possible byte values. So why not feed these raw bytes directly into an AI model?

The Problem with Bytes

The issue isn't what the model can read, but how much attention it has to spend reading it. Imagine someone telling you a story by pronouncing every letter:

"T... H... E... Q... U... I... C... K... B... R... O... W... N... F... O... X..."

By the time they reach "F-O-X," you've forgotten the beginning. Your brain is so busy assembling letters that it has no capacity left for meaning.

AI models face a similar challenge. They process text sequentially, one unit at a time, and can only hold so many units in memory. This limit is called the Context Window. When each byte is a separate unit, sequences become very long. For example, "Artificial Intelligence" is just 2 words to us, but 23 separate bytes for the model. Most of its capacity gets spent on low-level details instead of meaning.

What About Words?

The opposite extreme is to give every word its own ID: "Apple" = 101, "Banana" = 102. This keeps sequences short, but creates new problems. First, the vocabulary becomes huge. To cover all languages, names, and technical terms, we'd need millions of IDs, consuming massive amounts of memory during training. Second, the vocabulary is rigid. When a new term like "GPT-8" is invented, the model has no ID for it and fails to process it.

The Solution: Tokens

So bytes give us sequences that are too long, and words give us vocabularies that are too large. The middle ground is to start with bytes and merge adjacent ones based on frequency. For instance, the bytes for 't', 'h', and 'e' appear together so often that it's efficient to merge them into a single unit, called a token. Same with common words like "apple", "run", or suffixes like "ing". Rare words like "cryptocurrency" stay as separate pieces like "crypto" + "currency".

This gives us the best of both worlds. Common text stays compact and efficient, while any new or rare word can be built from smaller, known pieces.

7. Next: Tokenization

We now have the right concept: common words should be single units, and rare words should be split. This process of chunking text into pieces is called Tokenization, and the resulting pieces are called tokens.

Preview: Tokenization in action
"Hello" β†’[Hello](1 token)
"Unbelievable" β†’[Un][believ][able](3 tokens)
Summary
  • Unicode assigns a unique number (code point) to every character
  • UTF-8 encodes those numbers efficiently into bytes
  • Raw bytes are too granular (wasting model attention)
  • Words are too sparse (huge vocabularies)
  • Tokens are the middle ground: frequent patterns merged into single units

In the next chapter, we'll build the algorithm that finds these optimal tokens: Byte Pair Encoding (BPE).