In Chapter 1, we turned text into numbers—tokens mapped to IDs, IDs mapped to embeddings, each word becoming a point in a vast numerical space. But we skipped over something important: where do those embeddings come from? At the start of training, they're completely random. "Cat" and "dog" are no closer together than "cat" and "spreadsheet." The model knows absolutely nothing.
How does it go from random noise to writing poetry and explaining quantum physics?
Fill in the blank
The core idea behind training an LLM is almost absurdly simple: show it a sequence of tokens, hide the last one, and ask it to guess what comes next.
That's it. The entire training process is a massive game of fill-in-the-blank, repeated billions of times across terabytes of text. The model reads "The capital of France is ___" and tries to predict "Paris."
When it's wrong, it nudges its numbers—the embeddings, the attention weights, every other parameter it has—slightly in a direction that would have made the right answer more likely. Then it does it again with the next piece of text.
Your turn: be the model
Try it yourself. For each prompt below, guess what word comes next. Then see what the model would predict, and how confident it would be.
What word do you think comes next? Click one of the options below.
Your intuition works a lot like the model does. For famous phrases ("To be or not to ___"), you're almost certain of the answer. For open-ended sentences ("She opened the door and ___"), many options feel equally plausible. That gap between certainty and ambiguity is exactly what the model's probability distribution captures.
What are "parameters," exactly?
When you hear that a model has "70 billion parameters," it sounds impressive but vague. Each parameter is
a single number—a decimal like 0.0023 or -1.4072.
Together, these billions of numbers encode everything the model has learned.
But they're not all doing the same job. There are several distinct layers, and you already know one of them.
Embedding values. Remember the embedding space from Chapter 1—words as points in a high-dimensional space? Those coordinates are parameters. At the start of training, every token sits at a random position. As training progresses, the model gradually nudges them: "cat" drifts toward "dog" because they keep appearing in similar contexts. "King" drifts toward "queen." The clusters you explored in Chapter 1 didn't exist at the start—they emerged from billions of tiny adjustments.
Attention weights. These control how each token decides what other tokens to "look at" when making a prediction. We'll dig into attention properly in Chapter 3, but the short version is that these parameters give the model its ability to understand context—to figure out which surrounding words matter most for predicting what comes next.
Feed-forward network weights. Between each attention layer sit neural network layers that transform what the model gathered from attention into something useful. Think of attention as "gathering relevant information" and the feed-forward layers as "making sense of it." These parameters make up the bulk of the model's size.
The prediction layer. At the very end, a final set of parameters converts the model's internal representation into a probability for each of the ~100,000 tokens in its vocabulary. This is where the model says "I think there's a 92% chance the next token is 'Paris'."
All of these—embeddings, attention, feed-forward, prediction—get adjusted simultaneously during training. No single parameter stores a "fact." Knowledge is distributed across millions of parameters working together, across all the layers.
Watching it flow through the layers
Step through a single prediction below and watch the data transform as it passes through each layer, from raw embeddings to a probability distribution.
Notice how the numbers change at every layer. The embeddings are just the starting point—by the time the data reaches the prediction layer, it's been transformed through multiple rounds of attention and feed-forward processing. It's all of those parameters working together that produce the final prediction.
Training: the long, slow adjustment
Training works in a loop:
- Show the model some text with the last token hidden.
- The text flows through the pipeline—tokenization, embeddings, attention layers, feed-forward layers—until the model produces a probability distribution (a list of how likely each possible next token is) over all ~100,000 tokens in its vocabulary.
- Compare the prediction to reality. The difference is the "loss"—a single number measuring how wrong the model was.
- Trace the error backward through every layer, calculating how much each parameter contributed to the mistake. (The technical term is backpropagation—literally propagating the blame backward through the network.)
- Nudge every parameter slightly in the direction that would have reduced the error. Think of it as finding the downhill slope and taking a small step— that's gradient descent. The "gradient" is the slope; the "descent" is the step.
- Repeat—billions of times, across trillions of tokens of text.
That "nudge every parameter" step is where embeddings drift closer for words that appear in similar contexts. It's where attention weights learn to connect "closed" back to "bank" across a sentence, and where feed-forward layers learn to tell the difference between a question and a statement.
Every single pass adjusts all the parameters a tiny amount. Over billions of passes, those tiny amounts accumulate into something that looks a lot like understanding.
The text comes from a massive training dataset: books, websites, code, Wikipedia, academic papers, forum posts—a large fraction of what humans have written and published online. The model sees each piece of text and tries to predict it, token by token, picking up patterns at every scale: spelling, grammar, facts about the world, reasoning strategies, writing styles.
A note on training data and ethics. That phrase "a large fraction of what humans have written" deserves scrutiny. Most training datasets include copyrighted books, news articles, and other creative work—often without the knowledge or consent of the authors. This has led to significant legal battles, with authors and publishers suing AI companies for using their work without permission or compensation.
These aren't frivolous claims—courts have ruled against AI companies, and more cases are ongoing. The models work as well as they do because they were trained on the creative output of millions of people who were never asked and never compensated. How the industry addresses this—and whether it meaningfully does—is one of the most important unresolved questions in AI.
What the model actually learns
Through this process, the model picks up more than you'd expect:
- Grammar: It learns that "She goes" is more likely than "She go" without anyone teaching it grammar rules.
- Facts: It learns that "The capital of France is Paris" because that pattern appears many times in the training data.
- Reasoning patterns: By predicting step-by-step solutions in math textbooks, it picks up how to mimic logical reasoning.
- Style and tone: It learns to predict formal text in academic contexts and casual text in forum contexts.
But is it just memorizing?
There's a natural question here: is the model just memorizing its training data and regurgitating it? In reality, it does both—it memorizes and generalizes—and the distinction matters.
Memorization is when the model stores and reproduces specific sequences from its training data. Ask it to recite the opening of A Tale of Two Cities and it will give you "It was the best of times, it was the worst of times" essentially verbatim. It saw that exact sequence so many times that the prediction at each step is near-certain. The same goes for common code snippets, song lyrics, or famous speeches.
The model isn't "understanding" Dickens. It's reproducing a pattern it encountered repeatedly.
Generalization is different. It's when the model produces something it has never seen before by combining patterns it learned separately. Ask it to "write a haiku about database indexing" and it will produce something coherent, even though no such haiku existed in its training data. It learned haiku structure from poetry, it learned about database indexing from technical writing, and it can combine those patterns into something new.
If you memorize a recipe for chocolate cake, you can reproduce that exact cake. But if you've made hundreds of different cakes, you develop an intuition for how baking works—how flour and butter and sugar interact—and you can invent a recipe you've never seen. The first is memorization. The second is generalization.
LLMs do both, and the line between them is blurry. When the model writes a paragraph that sounds like a particular author, is it reproducing memorized phrases or generalizing from that author's style? Often it's a mix.
This tension—how much is rote recall versus genuine pattern combination—connects directly to the copyright concerns we mentioned earlier. If the model reproduces someone's writing verbatim, that feels clearly like a problem. If it learned an abstract pattern from millions of examples and applies it in a new context, the picture is murkier.
We've covered how a model goes from knowing nothing to encoding grammar, facts, and reasoning patterns in its parameters. But there's a piece we've been vague about: when the model is processing a sentence, how does it figure out which words matter to which other words? That mechanism is what we'll get into next.