The model was trained to do exactly one thing: predict the next token. And yet it writes essays, solves math problems, debugs code, and holds its own in philosophical debate. How does a prediction engine pull that off?
Prediction is more than it seems
"It just predicts the next word" sounds almost trivially simple. But think about what good prediction actually demands:
- To predict the next word in a math proof, you have to follow the logic of the proof.
- To predict the next word in a story, you have to understand characters, motivations, and narrative structure.
- To predict the next word of code, you have to understand programming syntax, logic, and the program's intent.
The training loop from Chapter 2 punishes the model every time it predicts badly. To reduce that punishment, the model has to build internal representations—encoded in those billions of attention and feed-forward parameters—that capture the structure of whatever it's predicting. It didn't set out to learn reasoning. But reasoning turned out to reduce prediction error, so the model developed something that looks a lot like it.
From probabilities to text: sampling
When you chat with an AI, it doesn't just pick the single most likely next word every time. It samples from its probability distribution. The parameter that controls this is called temperature.
Temperature controls how sharp or flat that distribution is. Low temperature makes the model more deterministic—it almost always picks the top prediction. High temperature flattens things out, giving less likely words a real shot at being chosen.
With a temperature of 0.70, the model is balanced between predictable and creative. The top prediction gets ~62% of the probability mass, and the model will sometimes surprise you with its second or third choice, but mostly stick to the most likely word.
This is why you get different answers when you ask the same question twice. The model isn't broken—it's sampling from a distribution, and different samples produce different text. Temperature is the dial between predictability and creativity.
Emergent abilities
As these models get larger, they don't just get incrementally better at the same tasks. They appear to gain new capabilities that smaller models can't touch:
- Multi-step reasoning: Breaking a complex problem into steps
- Analogical thinking: "X is to Y as A is to ___"
- Code generation: Writing working programs from descriptions
- Translation: Converting between languages, even ones with limited training data
This was initially described as emergence—abilities appearing suddenly at a specific scale, like water turning to ice at a critical temperature. Early research made it look like models couldn't do multi-step reasoning at all until they crossed some threshold, then could do it reliably.
In reality, the picture is more nuanced. Researchers have since shown that much of what looked like sudden emergence may be a measurement artifact. When you evaluate with sharp pass/fail criteria—the answer is exactly right or it's wrong—performance looks like a sudden jump. Use smoother metrics that give partial credit, and the improvement curve is often gradual. The ability was building incrementally; the measurement just made it look sudden.
That said, the practical difference is real. There are things a model with 100 billion parameters can do that a 1 billion parameter model simply cannot, regardless of how you measure it. Whether you call that "emergence" or "the cumulative effect of scale," the result is the same: bigger models aren't just better. They're different in kind.
Chain-of-thought: thinking step by step
Give a model this problem: "A store has 3 shelves. Each shelf holds 4 boxes. Each box contains 12 items. If 15% of items are defective, how many non-defective items are in the store?" Without any special prompting, the model often jumps straight to an answer and gets it wrong. Too many steps to hold at once.
But ask it to "think step by step," and something interesting happens:
3 shelves × 4 boxes = 12 boxes
12 boxes × 12 items = 144 items total
144 × 0.15 = 21.6 ≈ 21 defective items
144 − 21 = 123 non-defective items" (reliably correct)
Why does this work? Because each word the model generates becomes part of its context for the next word. When it writes "12 boxes × 12 items = 144 items total," that intermediate result is now sitting in the context window, available for the next step.
Without chain-of-thought, the model has to leap from question to answer, holding all the intermediate math in its parameters alone. With it, the model gets a scratch pad—each step's output becomes the next step's input.
RLHF: learning from human feedback
Raw prediction training produces a model that's good at continuing text—but continuing text isn't the same as being helpful. A prediction-only model might follow "How do I pick a lock?" with a detailed tutorial, because that's what would naturally come next in its training data.
Reinforcement Learning from Human Feedback (RLHF) is a second training phase that aligns the model with what humans actually want. The basic process: human raters look at pairs of model outputs for the same prompt and pick which response is better. Those preferences are used to train a separate "reward model" that can score any output on how well it matches what humans prefer. That reward model then guides the main model's further training—nudging its parameters toward outputs that score higher.
The result is a model that's more helpful and better at following instructions. This is why ChatGPT feels different from the raw model underneath it. The base capabilities come from prediction training. The conversational behavior and safety guardrails come from RLHF.
There's a lot more to how models get refined after initial training—fine-tuning, constitutional AI, direct preference optimization—but those are topics for another time. RLHF is a separate phase, layered on top of the prediction training we covered in Chapter 2, and it's what turns a text-completion engine into something that feels like an assistant.
What we've learned
- Words become numbers—text is tokenized into sub-word pieces, each mapped to a numerical representation.
- The model learns by predicting—billions of parameters are tuned through trillions of prediction exercises, learning grammar, facts, and reasoning patterns along the way.
- Attention connects everything—the transformer architecture lets every token consider every other token, building layered, contextual understanding.
- Prediction enables reasoning—the pressure to predict well forces the model to develop internal representations that look a lot like understanding, and techniques like chain-of-thought extend its capabilities further.
None of this involves consciousness or general intelligence. These are statistical engines that have learned extraordinarily rich patterns from human text. But understanding how they work changes the way you interact with them. You can reason about why a model struggles with a particular task, why a longer prompt might help, why "think step by step" makes a difference, and why the same question gets different answers twice.