In the previous chapter, we built a Tokenizer. We can now take a sentence like "I love cats" and convert it into a list of integers, something like [40, 1842, 9246]. It feels like we are ready. Computers operate on numbers, so why can we not just feed these integers directly into the Neural Network?
To answer this, you have to understand how a Neural Network actually "thinks." It is a mathematical engine. It uses multiplication, addition, dot products, and gradients (calculus) to find patterns. Crucially, it assumes that Magnitude Matters.
1. The Problem with Raw Token IDs
Suppose we fed raw Token IDs directly into the model. Imagine the tokenizer assigned IDs like this:
If you feed the number 100 and the number 500 into a neuron, the math implicitly assumes that the second input is "5 times greater" than the first.
This relationship is purely arithmetic. It has nothing to do with the relationship between an Apple and a Banana. These numbers are distinct labels, similar to Employee IDs in a company database. Employee #500 is not "5 times better" than Employee #100; they are just different people.
If we force the model to do math on these arbitrary IDs, we are asking it to find patterns in chaos. The model fails because of Magnitude Bias. In a neural network, a neuron applies the same weight (Output = Input × Weight) to its input, regardless of the value. If the weight is 0.01, an input of 100 (Apple) produces 1.0, but an input of 500 (Banana) produces 5.0. The math forces the model to treat Banana as "5 times more intense" than Apple. There is no single weight that can work for both numbers, so the model cannot learn a consistent rule for "Fruit".
We need a way to represent words where the numbers themselves actually contain the meaning.
2. Representing Meaning with Multiple Numbers
If a single number (an integer) fails to capture meaning, what if we used multiple numbers?
Consider a system where the numbers represent qualities of the object. To keep this intuitive, we will stick to two dimensions: Royalty and Gender.
Each axis is a question scored from -1.0 to +1.0:
- Royalty: "Is this about royalty?" (+1 strongly yes, -1 strongly the opposite, 0 unrelated)
- Gender: "Is this masculine or feminine?" (+1 masculine, -1 feminine, 0 neutral)
This is the breakthrough. By representing each word as a list of attributes (a Vector), we have encoded meaning into the numbers themselves. Look at the cards above: King and Queen share the same Royalty score, King and Man share the same Gender score, and Apple sits at zero for both because these human qualities simply do not apply to fruit.
3. Visualizing the Meaning Space
In our simple example we acted as human linguists and hand-picked clear categories like Royalty and Gender, but in real deep learning we do not define these labels beforehand. We instead give the model a fixed number of dimensions, known as the embedding size (often denoted as ), which effectively serves as a list of empty attribute slots waiting to be filled. We let the model figure out what they mean during training by allowing it to discover and assign its own abstract features to these slots, representing complex relationships like plurality, sentiment, or grammatical rules that humans might not even have names for.
But because we stuck to 2 dimensions for our example, we can plot these words on a standard X-Y graph.
Words are positioned based on their Royalty and Gender attributes.
In this 2D space, every word lands at a specific coordinate. We have turned words into Geometry.
Similarity = Closeness
Words with similar meanings cluster together. King and Queen both sit in the upper region (high Royalty). Man and King share the right side (positive Gender). Meanwhile, Apple sits alone at the origin. It has nothing in common with royalty or gender.
Region = Topic
The top half of the graph is the "Royal Region." The right half is the "Male Region." Words naturally organize into neighborhoods of related concepts.
4. Semantic Arithmetic: The King-Queen Analogy
Because we are now working with geometry, we can do something almost magical. We can perform arithmetic on meaning itself.
Both arrows point in the same direction, showing that "Gender" is a consistent direction in the space.
Look at the arrows on the map above. The arrow from King to Queen represents "Flipping Gender" while keeping Royalty constant. The arrow from Man to Woman represents the exact same transformation.
In a well-trained model, those two arrows are almost identical. The model has learned that the concept of "Gender" isn't just a label. It is a specific direction in space.
King - Man + Woman = ?
Let's plug in the numbers from Section 2:
Look back at our definitions in Section 2. Which word has the vector [1.0, -1.0]?
By taking the concept of a King, removing the "Man-ness", and adding "Woman-ness", we arrive mechanically at the coordinates for Queen. This demonstrates how the embedding space captures semantic relationships, showing that the model has organized language into a consistent geometric map where meaning can be manipulated mathematically.
5. Implementation: The Embedding Matrix
We know the model will learn these coordinates during training (remember those empty attribute slots from Section 3). But how exactly does this work in practice? Let's look at the actual data structure and the learning process.
The Embedding Layer: A Lookup Table
In code, an Embedding Layer is a table of numbers. Think of it like a spreadsheet: each row corresponds to a token ID, and each column is one of those attribute slots. When the model sees a token, it simply looks up the corresponding row.
| Token ID | Dim 1 | Dim 2 | Dim 3 | ... | Dim N |
|---|---|---|---|---|---|
| 0 ("the") | 0.12 | -0.45 | 0.78 | ... | 0.33 |
| 1 ("King") | 0.89 | 0.56 | -0.21 | ... | 0.67 |
| ... | ... | ... | ... | ... | ... |
When the model sees token ID 1, it simply looks up row 1 and grabs that entire row as the vector. Just a table lookup.
The size of this table depends on two choices you make when designing your model:
Vocabulary Size
Number of rows (one per token)
- GPT-2: ~50,000 tokens
- GPT-4: ~100,000 tokens
- Llama 2: ~32,000 tokens
Embedding Dimension
Number of columns (attributes per token)
- GPT-2 Small: 768 dimensions
- GPT-3: 12,288 dimensions
- Llama 2 70B: 8,192 dimensions
The Training Process
When we first initialize the embedding table, every cell is filled with random numbers. "King" might start at coordinates right next to "Sandwich." The map has no structure at all.
Training is how the model figures out what each dimension should represent, and it happens end-to-end with the rest of the model. We don't build embeddings separately; the embedding table is the first learned layer that feeds into the transformer stack.
When the model tries to predict the next word and makes a mistake, the "correction signal" (gradients) flows backwards through the entire network, updating the transformer blocks and finally the embedding vectors themselves. That reshaping teaches which coordinates should pull related tokens closer and push unrelated ones apart. It's how meaning gets baked into those embedding vectors at the same time the model learns how to use the words in context.
The training sentence.
Random prediction early in training.
The correct word is compared against the prediction.
"King" appeared near "Throne," so the model nudges "King" a little closer to "Throne" in its coordinate space, and a little further from "Banana."
This process repeats many times with different sentences. Every time "King" appears near "Queen," "Crown," or "Palace," the model nudges their coordinates closer. Every time "Apple" appears near "Orange" and "Banana," those fruit words cluster together.
After enough examples, the random noise transforms into the structured semantic space we visualized in Section 3. The model has discovered concepts like Royalty and Gender on its own, encoded them into its dimensions, and organized all the words accordingly.
6. Why Word Order Gets Lost
We have solved the meaning problem. We can translate "King" into a rich vector that captures its essence.
Here's the issue. Traditional language models (like RNNs) read words one by one, left to right, so order is built in. But this sequential approach is painfully slow. You cannot process word #5 until you have finished words #1 through #4. This makes training on billions of sentences take forever.
Transformer architectures (used in modern LLMs) solve the speed problem by processing all words in parallel, reading the entire sentence at once. This is massively faster and more scalable. But it creates a new problem: if you hand the model all words simultaneously, how does it know which came first?
Think back to our preprocessing pipeline: Text → Bytes → Tokens → Vectors. At no point did we encode where each word appears. The token ID for "Alice" is the same whether she appears first, third, or last in a sentence. And the embedding vector we just learned to look up? It only captures what the word means, not where it sits.
This becomes a problem when words need to interact with each other. In a Transformer, every word looks at every other word simultaneously to build understanding. Let's visualize this with the sentence "Alice gave Bob a book":
In the grid below, compare(wordA, wordB) is a placeholder for the actual mechanism (which we'll cover in a later chapter). For now, just think of it as "wordA examines wordB."
| Alice | gave | Bob | a | book | |
|---|---|---|---|---|---|
| Alice | — | compare(Alice, gave) | compare(Alice, Bob) | compare(Alice, a) | compare(Alice, book) |
| gave | compare(gave, Alice) | — | compare(gave, Bob) | compare(gave, a) | compare(gave, book) |
| Bob | compare(Bob, Alice) | compare(Bob, gave) | — | compare(Bob, a) | compare(Bob, book) |
| a | compare(a, Alice) | compare(a, gave) | compare(a, Bob) | — | compare(a, book) |
| book | compare(book, Alice) | compare(book, gave) | compare(book, Bob) | compare(book, a) | — |
Each cell shows one word examining another. All comparisons happen simultaneously.
Here's the problem: compare(Alice, Bob) and compare(Bob, Alice) use the exact same vectors. The function only sees two meaning-vectors. It has no idea that in "Alice gave Bob," Alice comes before the verb (making her the giver) while Bob comes after (making him the receiver).
Swap the sentence to "Bob gave Alice a book" and every compare() call produces identical results. The model cannot distinguish the giver from the receiver because position was never encoded.
We need a way to stamp each vector with its position. That's exactly what we'll tackle in the next chapter.
7. Summary
- Raw Token IDs cannot be used directly because the model would treat ID 500 as "5× more" than ID 100, which is meaningless
- Embeddings are vectors (lists of numbers) that encode meaning. Similar words cluster together in vector space
- The Embedding Matrix is learned during training: the model gradually nudges coordinates based on context
- Vector arithmetic works on concepts: King − Man + Woman ≈ Queen
- Parallel processing is fast and scalable but loses word order, and we've identified this as a critical missing piece
We can now represent what each token means, but not where it appears. "Alice gave Bob" and "Bob gave Alice" produce identical embeddings despite meaning opposite things. In the next chapter, we'll tackle Positional Encoding to solve this.