AI models can't read. Not really. Computers work with numbers, and language is made of words. Before a model can do anything with text, it has to figure out how to cross that gap.
What is a token, really?
There's a tendency to assume AI models work with words—reading them one at a time, the way you're reading this sentence. They don't. Instead, they break text into smaller pieces called tokens.
A token might be a whole word, part of a word, or even a single character. The word "understanding" could get split into "under" + "standing", or "un" + "derstand" + "ing". It depends on the model's vocabulary, the set of all tokens it knows.
Why not just use whole words? Because there are too many of them. English alone has hundreds of thousands, and new ones appear constantly ("ChatGPT", "deflategate"). Instead, models use a vocabulary of roughly 50,000–100,000 tokens, built from common sub-word patterns that can be combined to represent any text.
See it for yourself
The explorer below uses the same tokenizer as GPT-3.5 and GPT-4 (called cl100k_base).
Type something and watch how it gets broken apart. Each token gets a numeric
ID—hover over one to see it. Notice how common words stay whole while unusual
ones get split.
Type or edit the text above and watch how it gets broken into tokens below.
Try typing something unusual—a technical term, a made-up word, something in another language. Watch the tokenizer break it into smaller pieces. Then try a common phrase like "the cat sat on the mat"—those everyday words each stay as single tokens.
How does it decide where to split?
If you've been playing with the tokenizer, you've noticed the splits can feel arbitrary. Why does one word break at a particular point and not another? Who decided "est" is a token but "esti" isn't?
The answer is an algorithm called Byte Pair Encoding (BPE), and it's surprisingly simple. BPE builds a vocabulary by starting with the smallest possible pieces (individual characters) and then repeatedly merging the most common pair into a new token. That's it. The key insight is that it learns from a massive pile of training text, so the merges reflect actual language patterns.
The algorithm:
- Start with characters. Every letter is its own token. The word "low" starts as
low. - Count all adjacent pairs across the entire training text. Which two-character combination appears most often?
- Merge the winner. That pair becomes a new token in the vocabulary.
- Repeat thousands of times, until the vocabulary reaches the desired size (typically 50,000–100,000 tokens).
Step through the algorithm below to watch it build a vocabulary. Pay attention to how common patterns merge first and how the total token count shrinks with each merge.
As you step through:
- Common patterns merge first. Frequent pairs like "l" + "o" get merged early because they show up in many words.
- The text gets shorter. Each merge reduces the total token count. That's compression—the vocabulary is learning efficient representations.
- Whole words can emerge. After enough merges, common words like "low" become single tokens. Rare words stay split.
- It's language-agnostic. BPE doesn't know English grammar. It finds statistical patterns, which is why it works equally well for code, URLs, and other languages—it adapts to whatever text it trains on.
Hold on—isn't this just compression?
If you watched the token count drop as you stepped through the BPE visualizer, you might have had a thought: this is just compression. And you'd be right. BPE was originally invented by Philip Gage in 1994 as a data compression algorithm. The AI world borrowed it.
But if it's compression, why not use something better at it? Algorithms like Brotli or gzip produce much smaller output. The reason LLMs don't use them says a lot about what these models actually need.
Compression algorithms like Brotli optimize for size. They encode text using context-dependent references, essentially saying "this part is the same as something 4,000 bytes ago." The resulting "tokens" are opaque pointers that shift meaning depending on their surroundings. The same word might get encoded completely differently based on what came before it.
An LLM needs the opposite: a fixed, stable vocabulary where every piece has a permanent identity. The token for "ing" is always the same token, whether it appears in "running," "singing," or "tokenizing." That consistency matters because the model has to learn what each token means across billions of examples. If "ing" got a different representation every time, the model could never learn that it often signals a participle ("running"), a gerund ("swimming"), or any of the other patterns it appears in.
There's a practical constraint too. When the model generates text, it picks one token at a time from a fixed menu of ~100,000 options. Compression algorithms don't give you a fixed menu. They give you an ever-shifting set of references that only make sense within a specific stream of data.
From tokens to numbers
Each token in the vocabulary has a unique numeric ID. When you type "The cat sat on the mat,"
the model doesn't see words—it sees a sequence of numbers like
[791, 8415, 7731, 389, 279, 5634].
But a bare number isn't enough. The ID 8415 for "cat" is just an arbitrary
label—it doesn't carry any meaning. The model needs something richer, a representation
that captures what a token actually means. This is where embeddings
come in.
Embeddings: meaning as position in space
An embedding is a list of numbers, often hundreds or thousands of them, that represents a token's meaning. Coordinates in a vast, high-dimensional space where words with similar meanings end up near each other. Relationships between words become directions and distances.
The visualization below plots words in two dimensions, positioned by meaning. Real embeddings have hundreds of dimensions—this is a simplified projection, but the relationships are real.
Notice:
- Words cluster by meaning. Animals end up near animals, food near food, royalty near royalty.
- Relationships are directions. Switch to "See vector arithmetic" and play the animation. The direction from "man" to "king" is roughly the same as the direction from "woman" to "queen"—the model has learned a "royalty" direction in this space. (This relationship was first demonstrated in 2013.)
- Nearness means similarity. Hover over "puppy"—its nearest neighbors are "dog" and "kitten," which makes intuitive sense.
Nobody told the model that "cat" and "dog" are both animals. These clusters emerge from how embeddings are learned. During training, the model sees "cat" and "dog" in similar contexts—"The ___ sat on the couch," "She fed the ___," "The ___ was sleeping." Because they substitute for each other in so many sentences, the training process nudges their embeddings closer together. Words that appear in similar contexts end up with similar coordinates.
"King" and "queen" cluster together for the same reason—they share contexts like "The ___ ruled the kingdom." But "king" and "cat" rarely substitute for each other, so they end up far apart. We look at the clusters and say "those are the animals," but the model just knows they're words that show up in similar sentences.
The full pipeline: raw text gets split into tokens, each token maps to a numeric ID, and each ID gets converted into an embedding vector. What comes out the other end is a sequence of numerical representations the model can actually work with.
Why this matters in practice
Tokenization explains a surprising number of AI quirks:
- Why AI struggles with counting letters: It doesn't see letters, it sees token chunks. Ask it how many "r"s are in "strawberry" and it has to work with "str" + "aw" + "berry"—the individual letters are hidden inside the tokens.
- Why longer prompts cost more: AI pricing is per-token. More tokens means more computation and higher cost. Concise prompts are literally cheaper.
- Why there are "context limits": Models can only handle a fixed number of tokens at once. We'll dig into this in Chapter 3.
What the training data shapes
One question that comes up: if two models use the same tokenizer, why is one better at coding and another better at creative writing? Because tokenization is just the first step. The training data shapes everything that comes after.
Start with embeddings. A model trained on millions of lines of code will learn that
function, return, and if cluster together, and
that { and } have a tight relationship. A model
trained mostly on novels develops very different clusters—"whispered" and "murmured"
end up near each other, but it won't have strong embeddings for programming constructs
it rarely encountered.
But embeddings are just the starting representation. The real "knowledge" about how code or language works gets encoded in the billions of parameters in the layers above, the attention patterns and prediction weights we'll get into in upcoming chapters.
A code-heavy training set means those layers get tuned through millions of code
prediction exercises. Predicting the next token in Python functions, JavaScript
callbacks, SQL queries. The model doesn't just learn that def and
function are related tokens. It learns the patterns of how
code flows.
This is why training data matters so much and why companies invest enormous resources curating it. The same architecture, trained on different text, produces a fundamentally different model. The tokens and embeddings are the shared vocabulary. What the model learns to do with them depends entirely on what it practiced predicting.
All of which leads to an obvious question: how does that prediction actually work?