← Demystifying AI
The Fundamentals · Chapter 4

From Prediction to Reasoning

The model was trained to do exactly one thing: predict the next token. And yet it writes essays, solves math problems, debugs code, and holds its own in philosophical debate. How does a prediction engine pull that off?

Prediction is more than it seems

"It just predicts the next word" sounds almost trivially simple. But think about what good prediction actually demands:

The training loop from Chapter 2 punishes the model every time it predicts badly. To reduce that punishment, the model has to build internal representations—encoded in those billions of attention and feed-forward parameters—that capture the structure of whatever it's predicting. It didn't set out to learn reasoning. But reasoning turned out to reduce prediction error, so the model developed something that looks a lot like it.

"predict next word" what it looks like grammar logic facts narrative code style reasoning what it had to learn
Key idea: Prediction forces understanding. To predict what comes next in complex text, the model builds internal models of how language, logic, and the world work. Those internal models are what we experience as the AI's "intelligence."

From probabilities to text: sampling

When you chat with an AI, it doesn't just pick the single most likely next word every time. It samples from its probability distribution. The parameter that controls this is called temperature.

Temperature controls how sharp or flat that distribution is. Low temperature makes the model more deterministic—it almost always picks the top prediction. High temperature flattens things out, giving less likely words a real shot at being chosen.

low temperature predictable always picks the top word temperature low high high temperature creative / chaotic any word has a chance
Temperature & Sampling
0.70

With a temperature of 0.70, the model is balanced between predictable and creative. The top prediction gets ~62% of the probability mass, and the model will sometimes surprise you with its second or third choice, but mostly stick to the most likely word.

Probability Distribution
Sample Outputs

This is why you get different answers when you ask the same question twice. The model isn't broken—it's sampling from a distribution, and different samples produce different text. Temperature is the dial between predictability and creativity.

Emergent abilities

As these models get larger, they don't just get incrementally better at the same tasks. They appear to gain new capabilities that smaller models can't touch:

This was initially described as emergence—abilities appearing suddenly at a specific scale, like water turning to ice at a critical temperature. Early research made it look like models couldn't do multi-step reasoning at all until they crossed some threshold, then could do it reliably.

In reality, the picture is more nuanced. Researchers have since shown that much of what looked like sudden emergence may be a measurement artifact. When you evaluate with sharp pass/fail criteria—the answer is exactly right or it's wrong—performance looks like a sudden jump. Use smoother metrics that give partial credit, and the improvement curve is often gradual. The ability was building incrementally; the measurement just made it look sudden.

That said, the practical difference is real. There are things a model with 100 billion parameters can do that a 1 billion parameter model simply cannot, regardless of how you measure it. Whether you call that "emergence" or "the cumulative effect of scale," the result is the same: bigger models aren't just better. They're different in kind.

Chain-of-thought: thinking step by step

Give a model this problem: "A store has 3 shelves. Each shelf holds 4 boxes. Each box contains 12 items. If 15% of items are defective, how many non-defective items are in the store?" Without any special prompting, the model often jumps straight to an answer and gets it wrong. Too many steps to hold at once.

But ask it to "think step by step," and something interesting happens:

Without step-by-step: "There are 123 non-defective items." (often wrong — too many steps to juggle at once)
With step-by-step:
"Let me work through this:
3 shelves × 4 boxes = 12 boxes
12 boxes × 12 items = 144 items total
144 × 0.15 = 21.6 ≈ 21 defective items
144 − 21 = 123 non-defective items" (reliably correct)

Why does this work? Because each word the model generates becomes part of its context for the next word. When it writes "12 boxes × 12 items = 144 items total," that intermediate result is now sitting in the context window, available for the next step.

Without chain-of-thought, the model has to leap from question to answer, holding all the intermediate math in its parameters alone. With it, the model gets a scratch pad—each step's output becomes the next step's input.

How many non-defective items? scratch pad (context window) 3 x 4 = 12 boxes 12 x 12 = 144 items 144 - 21 = 123 each result becomes input for the next step
Key idea: Chain-of-thought reasoning works because the model's output becomes its own input. Each generated token extends the context window, letting the model build on its own intermediate results. It's not "thinking" internally—it's thinking out loud, and using that visible reasoning as context for the next step.

RLHF: learning from human feedback

Raw prediction training produces a model that's good at continuing text—but continuing text isn't the same as being helpful. A prediction-only model might follow "How do I pick a lock?" with a detailed tutorial, because that's what would naturally come next in its training data.

Reinforcement Learning from Human Feedback (RLHF) is a second training phase that aligns the model with what humans actually want. The basic process: human raters look at pairs of model outputs for the same prompt and pick which response is better. Those preferences are used to train a separate "reward model" that can score any output on how well it matches what humans prefer. That reward model then guides the main model's further training—nudging its parameters toward outputs that score higher.

model generates two answers Response A Response B human picks the better one learns to score outputs reward model "how good is this?" A: 0.82 B: 0.34 adjusts to score higher main model "be more like A" more helpful less harmful repeat with new outputs

The result is a model that's more helpful and better at following instructions. This is why ChatGPT feels different from the raw model underneath it. The base capabilities come from prediction training. The conversational behavior and safety guardrails come from RLHF.

There's a lot more to how models get refined after initial training—fine-tuning, constitutional AI, direct preference optimization—but those are topics for another time. RLHF is a separate phase, layered on top of the prediction training we covered in Chapter 2, and it's what turns a text-completion engine into something that feels like an assistant.

What we've learned

  1. Words become numbers—text is tokenized into sub-word pieces, each mapped to a numerical representation.
  2. The model learns by predicting—billions of parameters are tuned through trillions of prediction exercises, learning grammar, facts, and reasoning patterns along the way.
  3. Attention connects everything—the transformer architecture lets every token consider every other token, building layered, contextual understanding.
  4. Prediction enables reasoning—the pressure to predict well forces the model to develop internal representations that look a lot like understanding, and techniques like chain-of-thought extend its capabilities further.

None of this involves consciousness or general intelligence. These are statistical engines that have learned extraordinarily rich patterns from human text. But understanding how they work changes the way you interact with them. You can reason about why a model struggles with a particular task, why a longer prompt might help, why "think step by step" makes a difference, and why the same question gets different answers twice.

This is just the beginning. We haven't covered fine-tuning, multi-modal models (vision + language), retrieval-augmented generation (RAG), agents, or the many open questions about what these models truly represent. But you now have the conceptual foundation to make sense of those topics—and to ask better questions about the ones we haven't covered yet. Get notified when new content is published.