Learn AI by Building It

In the previous chapter, we built a Tokenizer. We can now take a sentence like "I love cats" and convert it into a list of integers, something like [40, 1842, 9246]. It feels like we are ready. Computers operate on numbers, so why can we not just feed these integers directly into the Neural Network?

To answer this, you have to understand how a Neural Network actually "thinks." It is a mathematical engine. It uses multiplication, addition, dot products, and gradients (calculus) to find patterns. Crucially, it assumes that Magnitude Matters.

1. The Problem with Raw Token IDs

Suppose we fed raw Token IDs directly into the model. Imagine the tokenizer assigned IDs like this:

ID 100:"Apple"

ID 500:"Banana"

If you feed the number 100 and the number 500 into a neuron, the math implicitly assumes that the second input is "5 times greater" than the first.

This relationship is purely arithmetic. It has nothing to do with the relationship between an Apple and a Banana. These numbers are distinct labels, similar to Employee IDs in a company database. Employee #500 is not "5 times better" than Employee #100; they are just different people.

The Core Problem

If we force the model to do math on these arbitrary IDs, we are asking it to find patterns in chaos. The model fails because of Magnitude Bias. In a neural network, a neuron applies the same weight (Output = Input × Weight) to its input, regardless of the value. If the weight is 0.01, an input of 100 (Apple) produces 1.0, but an input of 500 (Banana) produces 5.0. The math forces the model to treat Banana as "5 times more intense" than Apple. There is no single weight that can work for both numbers, so the model cannot learn a consistent rule for "Fruit".

We need a way to represent words where the numbers themselves actually contain the meaning.

2. Representing Meaning with Multiple Numbers

If a single number (an integer) fails to capture meaning, what if we used multiple numbers?

Consider a system where the numbers represent qualities of the object. To keep this intuitive, we will stick to two dimensions: Royalty and Gender.

Each axis is a question scored from -1.0 to +1.0:

Royalty: "Is this about royalty?" (+1 strongly yes, -1 strongly the opposite, 0 unrelated)
Gender: "Is this masculine or feminine?" (+1 masculine, -1 feminine, 0 neutral)

👑King

Royalty:1.0

Gender:1.0

Vector:[1.0, 1.0]

👸Queen

Royalty:1.0

Gender:-1.0

Vector:[1.0, -1.0]

🧔Man

Royalty:0.0

Gender:1.0

Vector:[0.0, 1.0]

🍎Apple

Royalty:0.0

Gender:0.0

Vector:[0.0, 0.0]

This is the breakthrough. By representing each word as a list of attributes (a Vector), we have encoded meaning into the numbers themselves. Look at the cards above: King and Queen share the same Royalty score, King and Man share the same Gender score, and Apple sits at zero for both because these human qualities simply do not apply to fruit.

3. Visualizing the Meaning Space

In our simple example we acted as human linguists and hand-picked clear categories like Royalty and Gender, but in real deep learning we do not define these labels beforehand. We instead give the model a fixed number of dimensions, known as the embedding size (often denoted as $d_{model}$ ), which effectively serves as a list of empty attribute slots waiting to be filled. We let the model figure out what they mean during training by allowing it to discover and assign its own abstract features to these slots, representing complex relationships like plurality, sentiment, or grammatical rules that humans might not even have names for.

But because we stuck to 2 dimensions for our example, we can plot these words on a standard X-Y graph.

The 2D Meaning Space

Words are positioned based on their Royalty and Gender attributes.

In this 2D space, every word lands at a specific coordinate. We have turned words into Geometry.

Similarity = Closeness

Words with similar meanings cluster together. King and Queen both sit in the upper region (high Royalty). Man and King share the right side (positive Gender). Meanwhile, Apple sits alone at the origin. It has nothing in common with royalty or gender.

Region = Topic

The top half of the graph is the "Royal Region." The right half is the "Male Region." Words naturally organize into neighborhoods of related concepts.

4. Semantic Arithmetic: The King-Queen Analogy

Because we are now working with geometry, we can do something almost magical. We can perform arithmetic on meaning itself.

Vector Arithmetic: The Gender Direction

Same direction = Same concept

Both arrows point in the same direction, showing that "Gender" is a consistent direction in the space.

Look at the arrows on the map above. The arrow from King to Queen represents "Flipping Gender" while keeping Royalty constant. The arrow from Man to Woman represents the exact same transformation.

In a well-trained model, those two arrows are almost identical. The model has learned that the concept of "Gender" isn't just a label. It is a specific direction in space.

The Famous Word Puzzle

King - Man + Woman = ?

Let's plug in the numbers from Section 2:

King:[1.0, 1.0]

Minus Man:-[0.0, 1.0]

Plus Woman:+[0.0, -1.0]

Calculation:

First dimension (Royalty):

1.0 - 0.0 + 0.0 =

1.0

Second dimension (Gender):

1.0 - 1.0 + (-1.0) =

-1.0

Result:[1.0, -1.0]

Look back at our definitions in Section 2. Which word has the vector [1.0, -1.0]?

👸Queen!

By taking the concept of a King, removing the "Man-ness", and adding "Woman-ness", we arrive mechanically at the coordinates for Queen. This demonstrates how the embedding space captures semantic relationships, showing that the model has organized language into a consistent geometric map where meaning can be manipulated mathematically.

5. Implementation: The Embedding Matrix

We know the model will learn these coordinates during training (remember those empty attribute slots from Section 3). But how exactly does this work in practice? Let's look at the actual data structure and the learning process.

The Embedding Layer: A Lookup Table

In code, an Embedding Layer is a table of numbers. Think of it like a spreadsheet: each row corresponds to a token ID, and each column is one of those attribute slots. When the model sees a token, it simply looks up the corresponding row.

The Embedding Matrix Structure

Token ID	Dim 1	Dim 2	Dim 3	...	Dim N
0 ("the")	0.12	-0.45	0.78	...	0.33
1 ("King")	0.89	0.56	-0.21	...	0.67
...	...	...	...	...	...

When the model sees token ID 1, it simply looks up row 1 and grabs that entire row as the vector. Just a table lookup.

The size of this table depends on two choices you make when designing your model:

Vocabulary Size

Number of rows (one per token)

GPT-2: ~50,000 tokens
GPT-4: ~100,000 tokens
Llama 2: ~32,000 tokens

Embedding Dimension

Number of columns (attributes per token)

GPT-2 Small: 768 dimensions
GPT-3: 12,288 dimensions
Llama 2 70B: 8,192 dimensions

Why Do Models Use Different Sizes?

Vocabulary size (rows): More tokens means fewer unknown words, but requires more memory.

Embedding dimension (columns): More dimensions allows richer meaning representation, but increases memory usage and computation time.

When designing a model, you balance these based on your constraints: available compute, memory budget, target latency, and how much training data you have.

The Training Process

When we first initialize the embedding table, every cell is filled with random numbers. "King" might start at coordinates right next to "Sandwich." The map has no structure at all.

Training is how the model figures out what each dimension should represent, and it happens end-to-end with the rest of the model. We don't build embeddings separately; the embedding table is the first learned layer that feeds into the transformer stack.

When the model tries to predict the next word and makes a mistake, the "correction signal" (gradients) flows backwards through the entire network, updating the transformer blocks and finally the embedding vectors themselves. That reshaping teaches which coordinates should pull related tokens closer and push unrelated ones apart. It's how meaning gets baked into those embedding vectors at the same time the model learns how to use the words in context.

Model sees a sentence with a gap:

"The King sat on the ___"

The training sentence.

It makes a guess:

"Banana" (totally wrong, but it's early in training!)

Random prediction early in training.

We reveal the correct answer:

"The King sat on the Throne"

The correct word is compared against the prediction.

The model adjusts its internal map:

"King" appeared near "Throne," so the model nudges "King" a little closer to "Throne" in its coordinate space, and a little further from "Banana."

This process repeats many times with different sentences. Every time "King" appears near "Queen," "Crown," or "Palace," the model nudges their coordinates closer. Every time "Apple" appears near "Orange" and "Banana," those fruit words cluster together.

After enough examples, the random noise transforms into the structured semantic space we visualized in Section 3. The model has discovered concepts like Royalty and Gender on its own, encoded them into its dimensions, and organized all the words accordingly.

6. Why Word Order Gets Lost

We have solved the meaning problem. We can translate "King" into a rich vector that captures its essence.

Here's the issue. Traditional language models (like RNNs) read words one by one, left to right, so order is built in. But this sequential approach is painfully slow. You cannot process word #5 until you have finished words #1 through #4. This makes training on billions of sentences take forever.

Transformer architectures (used in modern LLMs) solve the speed problem by processing all words in parallel, reading the entire sentence at once. This is massively faster and more scalable. But it creates a new problem: if you hand the model all words simultaneously, how does it know which came first?

Think back to our preprocessing pipeline: Text → Bytes → Tokens → Vectors. At no point did we encode where each word appears. The token ID for "Alice" is the same whether she appears first, third, or last in a sentence. And the embedding vector we just learned to look up? It only captures what the word means, not where it sits.

This becomes a problem when words need to interact with each other. In a Transformer, every word looks at every other word simultaneously to build understanding. Let's visualize this with the sentence "Alice gave Bob a book":

In the grid below, compare(wordA, wordB) is a placeholder for the actual mechanism (which we'll cover in a later chapter). For now, just think of it as "wordA examines wordB."

	Alice	gave	Bob	a	book
Alice	—	compare(Alice, gave)	compare(Alice, Bob)	compare(Alice, a)	compare(Alice, book)
gave	compare(gave, Alice)	—	compare(gave, Bob)	compare(gave, a)	compare(gave, book)
Bob	compare(Bob, Alice)	compare(Bob, gave)	—	compare(Bob, a)	compare(Bob, book)
a	compare(a, Alice)	compare(a, gave)	compare(a, Bob)	—	compare(a, book)
book	compare(book, Alice)	compare(book, gave)	compare(book, Bob)	compare(book, a)	—

Each cell shows one word examining another. All comparisons happen simultaneously.

Here's the problem: compare(Alice, Bob) and compare(Bob, Alice) use the exact same vectors. The function only sees two meaning-vectors. It has no idea that in "Alice gave Bob," Alice comes before the verb (making her the giver) while Bob comes after (making him the receiver).

Swap the sentence to "Bob gave Alice a book" and every compare() call produces identical results. The model cannot distinguish the giver from the receiver because position was never encoded.

We need a way to stamp each vector with its position. That's exactly what we'll tackle in the next chapter.

7. Summary

What We Learned

Raw Token IDs cannot be used directly because the model would treat ID 500 as "5× more" than ID 100, which is meaningless
Embeddings are vectors (lists of numbers) that encode meaning. Similar words cluster together in vector space
The Embedding Matrix is learned during training: the model gradually nudges coordinates based on context
Vector arithmetic works on concepts: King − Man + Woman ≈ Queen
Parallel processing is fast and scalable but loses word order, and we've identified this as a critical missing piece

We can now represent what each token means, but not where it appears. "Alice gave Bob" and "Bob gave Alice" produce identical embeddings despite meaning opposite things. In the next chapter, we'll tackle Positional Encoding to solve this.