Ch. 1: Words to Numbers | Demystifying AI

AI models can't read. Not really. Computers work with numbers, and language is made of words. Before a model can do anything with text, it has to figure out how to cross that gap.

What is a token, really?

There's a tendency to assume AI models work with words—reading them one at a time, the way you're reading this sentence. They don't. Instead, they break text into smaller pieces called tokens.

A token might be a whole word, part of a word, or even a single character. The word "understanding" could get split into "under" + "standing", or "un" + "derstand" + "ing". It depends on the model's vocabulary, the set of all tokens it knows.

Key idea: A token is not a word. Common words like "the" are single tokens, but unusual words get split into pieces. This is why AI sometimes struggles with spelling—it doesn't see individual letters, it sees chunks.

Why not just use whole words? Because there are too many of them. English alone has hundreds of thousands, and new ones appear constantly ("ChatGPT", "deflategate"). Instead, models use a vocabulary of roughly 50,000–100,000 tokens, built from common sub-word patterns that can be combined to represent any text.

See it for yourself

The explorer below uses the same tokenizer as GPT-3.5 and GPT-4 (called cl100k_base). Type something and watch how it gets broken apart. Each token gets a numeric ID—hover over one to see it. Notice how common words stay whole while unusual ones get split.

Live Tokenizer (cl100k_base — used by GPT-3.5 & GPT-4)

Type or edit the text above and watch how it gets broken into tokens below.

Characters: 0 Tokens: 0 Ratio: 0 chars/token

Tokenizing...

Hover a token to see its numeric ID

Try typing something unusual—a technical term, a made-up word, something in another language. Watch the tokenizer break it into smaller pieces. Then try a common phrase like "the cat sat on the mat"—those everyday words each stay as single tokens.

How does it decide where to split?

If you've been playing with the tokenizer, you've noticed the splits can feel arbitrary. Why does one word break at a particular point and not another? Who decided "est" is a token but "esti" isn't?

The answer is an algorithm called Byte Pair Encoding (BPE), and it's surprisingly simple. BPE builds a vocabulary by starting with the smallest possible pieces (individual characters) and then repeatedly merging the most common pair into a new token. That's it. The key insight is that it learns from a massive pile of training text, so the merges reflect actual language patterns.

The algorithm:

Start with characters. Every letter is its own token. The word "low" starts as l o w.
Count all adjacent pairs across the entire training text. Which two-character combination appears most often?
Merge the winner. That pair becomes a new token in the vocabulary.
Repeat thousands of times, until the vocabulary reaches the desired size (typically 50,000–100,000 tokens).

Step through the algorithm below to watch it build a vocabulary. Pay attention to how common patterns merge first and how the total token count shrinks with each merge.

How BPE Builds a Vocabulary

Training text

Step 0 Starting with individual characters

Most frequent pairs

Vocabulary (256 base + 0 merges)

Total tokens in text: 0 Compression: 1.0x

As you step through:

Common patterns merge first. Frequent pairs like "l" + "o" get merged early because they show up in many words.
The text gets shorter. Each merge reduces the total token count. That's compression—the vocabulary is learning efficient representations.
Whole words can emerge. After enough merges, common words like "low" become single tokens. Rare words stay split.
It's language-agnostic. BPE doesn't know English grammar. It finds statistical patterns, which is why it works equally well for code, URLs, and other languages—it adapts to whatever text it trains on.

Key idea: Humans don't design the vocabulary. It's discovered from data. BPE finds the most efficient set of reusable pieces for representing the training text. Common words become single tokens; rare words get assembled from common sub-word parts. That's why "the" is one token but "defenestration" is four.

Hold on—isn't this just compression?

If you watched the token count drop as you stepped through the BPE visualizer, you might have had a thought: this is just compression. And you'd be right. BPE was originally invented by Philip Gage in 1994 as a data compression algorithm. The AI world borrowed it.

But if it's compression, why not use something better at it? Algorithms like Brotli or gzip produce much smaller output. The reason LLMs don't use them says a lot about what these models actually need.

Compression algorithms like Brotli optimize for size. They encode text using context-dependent references, essentially saying "this part is the same as something 4,000 bytes ago." The resulting "tokens" are opaque pointers that shift meaning depending on their surroundings. The same word might get encoded completely differently based on what came before it.

An LLM needs the opposite: a fixed, stable vocabulary where every piece has a permanent identity. The token for "ing" is always the same token, whether it appears in "running," "singing," or "tokenizing." That consistency matters because the model has to learn what each token means across billions of examples. If "ing" got a different representation every time, the model could never learn that it often signals a participle ("running"), a gerund ("swimming"), or any of the other patterns it appears in.

There's a practical constraint too. When the model generates text, it picks one token at a time from a fixed menu of ~100,000 options. Compression algorithms don't give you a fixed menu. They give you an ever-shifting set of references that only make sense within a specific stream of data.

The tradeoff: BPE deliberately sacrifices compression efficiency for a stable, learnable vocabulary. Every token gets a permanent ID and a permanent meaning the model can learn over time. That's worth far more to a neural network than a few extra percentage points of compression.

From tokens to numbers

Each token in the vocabulary has a unique numeric ID. When you type "The cat sat on the mat," the model doesn't see words—it sees a sequence of numbers like [791, 8415, 7731, 389, 279, 5634].

But a bare number isn't enough. The ID 8415 for "cat" is just an arbitrary label—it doesn't carry any meaning. The model needs something richer, a representation that captures what a token actually means. This is where embeddings come in.

Embeddings: meaning as position in space

An embedding is a list of numbers, often hundreds or thousands of them, that represents a token's meaning. Coordinates in a vast, high-dimensional space where words with similar meanings end up near each other. Relationships between words become directions and distances.

The visualization below plots words in two dimensions, positioned by meaning. Real embeddings have hundreds of dimensions—this is a simplified projection, but the relationships are real.

Words as Points in Space

Hover over a word to see its nearest neighbors.

Notice:

Words cluster by meaning. Animals end up near animals, food near food, royalty near royalty.
Relationships are directions. Switch to "See vector arithmetic" and play the animation. The direction from "man" to "king" is roughly the same as the direction from "woman" to "queen"—the model has learned a "royalty" direction in this space. (This relationship was first demonstrated in 2013.)
Nearness means similarity. Hover over "puppy"—its nearest neighbors are "dog" and "kitten," which makes intuitive sense.

Nobody told the model that "cat" and "dog" are both animals. These clusters emerge from how embeddings are learned. During training, the model sees "cat" and "dog" in similar contexts—"The ___ sat on the couch," "She fed the ___," "The ___ was sleeping." Because they substitute for each other in so many sentences, the training process nudges their embeddings closer together. Words that appear in similar contexts end up with similar coordinates.

"King" and "queen" cluster together for the same reason—they share contexts like "The ___ ruled the kingdom." But "king" and "cat" rarely substitute for each other, so they end up far apart. We look at the clusters and say "those are the animals," but the model just knows they're words that show up in similar sentences.

Analogy: Imagine describing a person using only numbers: height, weight, age, extroversion on a 1-10 scale, musical ability on a 1-10 scale. With enough numbers, you could distinguish anyone from anyone else. Embeddings do the same thing for words—except with hundreds of dimensions, and the model figures out which "dimensions" matter on its own.

The full pipeline: raw text gets split into tokens, each token maps to a numeric ID, and each ID gets converted into an embedding vector. What comes out the other end is a sequence of numerical representations the model can actually work with.

The full pipeline: text is split into tokens, each mapped to an ID, each converted to a rich numerical vector.

The big picture: Tokenization is how AI crosses from human language to mathematics. Everything that follows—pattern learning, attention, prediction—operates on these numerical representations. The quality of this first step shapes everything the model can do.

Why this matters in practice

Tokenization explains a surprising number of AI quirks:

Why AI struggles with counting letters: It doesn't see letters, it sees token chunks. Ask it how many "r"s are in "strawberry" and it has to work with "str" + "aw" + "berry"—the individual letters are hidden inside the tokens.
Why longer prompts cost more: AI pricing is per-token. More tokens means more computation and higher cost. Concise prompts are literally cheaper.
Why there are "context limits": Models can only handle a fixed number of tokens at once. We'll dig into this in Chapter 3.

What the training data shapes

One question that comes up: if two models use the same tokenizer, why is one better at coding and another better at creative writing? Because tokenization is just the first step. The training data shapes everything that comes after.

Start with embeddings. A model trained on millions of lines of code will learn that function, return, and if cluster together, and that { and } have a tight relationship. A model trained mostly on novels develops very different clusters—"whispered" and "murmured" end up near each other, but it won't have strong embeddings for programming constructs it rarely encountered.

But embeddings are just the starting representation. The real "knowledge" about how code or language works gets encoded in the billions of parameters in the layers above, the attention patterns and prediction weights we'll get into in upcoming chapters.

A code-heavy training set means those layers get tuned through millions of code prediction exercises. Predicting the next token in Python functions, JavaScript callbacks, SQL queries. The model doesn't just learn that def and function are related tokens. It learns the patterns of how code flows.

This is why training data matters so much and why companies invest enormous resources curating it. The same architecture, trained on different text, produces a fundamentally different model. The tokens and embeddings are the shared vocabulary. What the model learns to do with them depends entirely on what it practiced predicting.

All of which leads to an obvious question: how does that prediction actually work?