Learn AI by Building It

In the previous chapter, we built a tokenizer that converts text into integers. A sentence like "I love cats" becomes [40, 1842, 11875]. These token IDs are just identifiers, they don't capture what words mean or how they relate to each other. This chapter introduces embeddings, vectors that can express meaning.

1. Why Token IDs Aren't Enough

Token IDs tell us which token we're looking at, but nothing about what it means. For example, the word "Queen" carries multiple aspects of meaning: royalty, gender, relationships to other words like "King" or "Princess." A categorical label like 32466 can't capture any of that.

What we need is a representation with enough structure to express how words relate to each other. "King" and "Queen" are related. "Red" and "Blue" belong to the same category. "Paris" connects to "France" the way "Tokyo" connects to "Japan." A single number can't express any of these relationships.

2. Representing Meaning with Multiple Numbers

What if instead of one number, we used a list of numbers, called a vector, to represent each token? Each number could capture a different aspect of meaning.

Let's try it with two numbers, one capturing Royalty and one capturing Gender. We score each token on both, using a scale from -1.0 to +1.0. For Royalty, +1 means strongly royal, -1 means the opposite, and 0 means unrelated. For Gender, +1 means masculine, -1 means feminine, and 0 means neutral.

With these two dimensions, we can represent tokens as pairs of numbers:

Token	Royalty	Gender	Vector
🤴 King	1.0	1.0	[1.0, 1.0]
👸 Queen	1.0	-1.0	[1.0, -1.0]
🧔 Man	0.0	1.0	[0.0, 1.0]
🍎 Apple	0.0	0.0	[0.0, 0.0]

Now we can express meaning and relationships through numbers. King and Queen share the same Royalty score. King and Man share the same Gender score. Apple sits at zero for both, since royalty and gender don't apply to fruit.

3. Visualizing the Vector Space

Since we used only two dimensions, we can plot these tokens on a 2D graph with Royalty on the vertical axis and Gender on the horizontal.

2D Vector Space

Tokens are positioned based on their Royalty and Gender scores.

Each token now sits at a specific coordinate. The space organizes tokens by meaning: King and Queen sit at the top (high royalty), while King and Man sit on the right (masculine). Apple sits at the origin because neither dimension applies to it.

Because each dimension represents an attribute, the direction you travel from one token to another captures their relationship. Notice the arrows in the graph below.

Vector Arithmetic

Same direction = Same concept

Both arrows point in the same direction, showing that "Gender" is a consistent direction in the space.

The arrow from King to Queen represents "flip gender while keeping royalty the same." The arrow from Man to Woman represents the same transformation. Both point the same direction because both represent the same concept.

This geometry allows us to do arithmetic on meaning. If we take the concept of King, subtract the Man component, and add Woman, we should logically discover the female equivalent of royalty.

King:[1.0, 1.0]

Minus Man:-[0.0, 1.0]

Plus Woman:+[0.0, -1.0]

Result:[1.0, -1.0]

The result is exactly the vector for Queen. By expanding from a single number to a vector, we can express both the meaning of each token and the relationships between them.

4. The Embedding Layer

The vectors we have been building are called embeddings, and the component that stores them is called the embedding layer. In the previous sections, we hand-picked two dimensions and manually assigned values. In practice, we only decide how many dimensions to use and initialize all values randomly. The model learns the rest during training.

The embedding layer is implemented as a 2D array where each row stores the vector for one token. If our vocabulary has $V$ tokens and we choose $d_{model}$ dimensions, the array has shape [V, d_model]. To convert a token ID into its embedding, the layer looks up that row:

Token ID	Dim 1	Dim 2	Dim 3	...	Dim $d_{model}$
0 ("the")	0.12	-0.45	0.78	...	0.33
1 ("King")	0.89	0.56	-0.21	...	0.67
...	...	...	...	...	...

Two numbers define the embedding layer: the vocabulary size $V$ (number of rows) and the embedding dimension $d_{model}$ (number of columns). A larger vocabulary means more tokens can exist as single entries, so sentences become shorter sequences, but it adds more rows and increases memory.

A larger embedding dimension gives the model more capacity to represent meaning, but it adds more columns to every row and increases both memory and computation. In practice, vocabulary sizes range from around 32,000 to 100,000 tokens, while embedding dimensions range from a few hundred to several thousand depending on the model's scale.

When training begins, every value in this table is initialized randomly.

During training, the model reads billions of sentences and tries to predict the next token. When it gets a prediction wrong, the error signal flows backward through the network and nudges the embedding values.

After enough examples, tokens that appear in similar contexts develop similar vectors, and the organized structure we visualized earlier emerges. Unlike our hand-picked Royalty and Gender dimensions, the learned dimensions are abstract and often uninterpretable, but they capture whatever helps the model predict well.

5. What Embeddings Don't Capture

We now have a way to represent what each token means. The embedding layer converts token IDs into vectors where similar tokens cluster together and relationships become geometric directions.

But embeddings do not capture position. The embedding layer is just a lookup table, so the vector for "Alice" is the same whether she appears first, third, or last in a sentence.

This matters because Transformers process all tokens in parallel, seeing the entire sentence at once. This means that without position information, "Alice helped Bob" and "Bob helped Alice" produce identical embeddings. The next chapter introduces positional encoding to address this.

Summary

Token IDs are just labels, a single number can't capture meaning
Embeddings represent tokens as vectors where each dimension captures an aspect of meaning
Similar tokens cluster together in vector space, and relationships become directions (King − Man + Woman ≈ Queen)
The embedding layer is a learned lookup table with shape [V, d_model] that gets refined during training
Embeddings don't capture position, a token has the same embedding regardless of where it appears