Ch. 3: Attention & Context | Demystifying AI

The idea that made modern language models work is called attention. It lets a model weigh the relationships between all words in a passage at the same time. Before attention, models processed words in sequence and lost track of the beginning by the time they reached the end.

Words are meaningless alone

Think about the word "bank." Financial institution, or the edge of a river? The word alone doesn't tell you. You need the words around it. "I deposited money at the bank" versus "I sat on the river bank."

Earlier AI models processed words one at a time, in order—like reading through a foggy window where you can only make out the current word and a fading memory of what came before. By the time you reached the end of a long sentence, the beginning had gone blurry.

Attention threw that limitation out. Instead of plodding through words in sequence, the transformer architecture lets every word look at every other word simultaneously and decide which connections matter most.

How attention actually works

Remember from Chapter 1 that each token enters the model as an embedding—a long list of numbers. Attention transforms that embedding by mixing in information from the other tokens. Each word creates a signal that says "what am I looking for," compares that against every other word's "what do I have to offer" signal, and blends in the winners' information. The mechanics:

The model has three sets of learned weights, and it uses them to create three new vectors for each token:

Query: "What am I looking for?" — computed by multiplying the token's embedding by a learned weight matrix.
Key: "What do I have to offer?" — computed by multiplying the same embedding by a different weight matrix.
Value: "What information do I carry?" — a third weight matrix, producing the actual content to pass along.

These are matrix multiplications—the same input goes in three times, multiplied by three different learned weight matrices, producing three different vectors. No magic. Just arithmetic.

The clever part comes next. To figure out how much word A should attend to word B, the model takes A's Query and B's Key and computes their dot product—multiply the corresponding numbers and add them up. A high dot product means "A is looking for something B has." A low one means they're not relevant to each other.

Those dot products get converted into probabilities (normalized so they sum to 1), and then each token's output becomes a weighted average of all the Value vectors, scaled by those attention scores. If "closed" gives 40% attention to "bank" and 25% to "was," then 40% of "bank's" Value vector and 25% of "was's" Value vector get blended into the new representation for "closed."

Those three weight matrices (Query, Key, and Value) are learned parameters. They start random and get adjusted during training, just like the embeddings and feed-forward weights from Chapter 2. Nobody programs in "look for the subject of the sentence." The model discovers that pattern on its own, because it helps predict the next word better.

Key idea: Attention is just matrix multiplication and dot products—but the weight matrices are learned, so the model discovers for itself what's worth paying attention to. "The bank by the river was closed"—when processing "closed," its Query happens to match strongly with "bank's" Key, so "bank's" Value gets pulled in. The model learned to make this connection because it helped with prediction during training.

Watch the matching happen

Pick a word below and watch the three-step process play out: see its Query vector, watch it get compared against every other word's Key via dot products, and see how those scores become attention weights. Try switching between attention heads to see how different heads focus on different relationships.

Query, Key, Value — Step by Step

Click a word to see its Query match against every Key:

Attention head:

Notice how the three heads produce completely different attention patterns for the same word. The Grammar head might connect "sat" to "cat" (verb to subject), while the Meaning head connects "mat" to "on" (object to spatial relationship), and the Position head favors nearby words regardless of meaning. The model runs all these heads at once and combines their results—that's what "multi-head attention" means.

Attention in action

Click any word in the sentence below to see its attention pattern. Thicker lines mean stronger attention—the model is pulling more information from that word. Then try reducing the context window with the slider and watch connections disappear.

Attention — What Each Word "Looks At"

Click any word below to see which other words it pays attention to. Thicker, darker lines mean stronger attention.

Click a word above to explore.

Context window: 10 words

Drag the slider to limit how far back the model can "see." Watch how restricting the context changes which connections are possible.

As you explore:

"river" attends strongly to "bank"—this is disambiguation in action. The word "river" provides the context that tells the model we're talking about a physical bank, not a financial one.
"closed" attends to "bank"—it needs to know what was closed. Several words sit between them, but attention bridges the gap directly.
Reducing the context window can break these connections. If the model can only see 4 words back when processing "closed," it loses the connection to "bank" entirely. This is why context window size matters.

Why can it only look backward?

You might have noticed something in the visualizer: each word can only attend to words that came before it, never words after. "closed" can look back at "bank," but "bank" can't look ahead at "closed." That seems like a limitation—wouldn't the model be better off seeing the full picture?

Models like GPT and Claude generate text one token at a time, left to right. When the model is deciding what comes after "The bank by the river was," the words "closed for the holiday" don't exist yet. There's nothing forward to look at. The model is creating those words. Looking backward is the only option.

This is called causal (or autoregressive) attention, used by every major chatbot and text generator. The "causal" part means each position can only be caused by what came before it, never by what comes after.

But not all models work this way. BERT, a model Google built for understanding text (not generating it), uses bidirectional attention—every word can see every other word, forward and backward. That makes sense for BERT's job. When you're classifying a sentence, analyzing sentiment, or powering a search engine, the entire text already exists. There's no "future" to hide. BERT can read the whole thing at once and use every word to understand every other word.

The tradeoff is clean: bidirectional models understand text better because they see the whole picture, but they can't generate new text. Causal models can generate but always work with incomplete information—writing the next word without knowing what's coming.

Does a chatbot use both?

There's a tendency to assume a chatbot would use bidirectional attention to understand your question and then switch to causal attention to generate its reply. Reasonable intuition, but that's not how GPT, Claude, and most modern chatbots actually work.

These models use causal attention for everything—understanding included. The entire conversation (your message and the model's reply) is treated as one continuous sequence of tokens, processed left to right. There's no "understanding phase" followed by a "generation phase."

When you type "What is the capital of France?", the model processes each word causally—"What" can only see itself, "is" sees "What is", "the" sees "What is the," and so on. By the time it reaches the question mark, the last token's representation has gathered context from your entire question through many layers of attention. Then it keeps going, generating the answer one token at a time.

The model "understands" your question as a side effect of processing it left to right. It's all one pass—no mode switching.

Some architectures do mix both approaches. These are called encoder-decoder models (like Google's T5, or the original Transformer from the famous "Attention Is All You Need" paper). They use bidirectional attention on the input and causal attention on the output—exactly the split you might have guessed. But the dominant chatbot architecture today (called "decoder-only") keeps it simple: causal all the way through. And it works.

Three architectures: Encoder-only models (BERT) use bidirectional attention and are built to understand text. Decoder-only models (GPT, Claude, Llama) use causal attention for everything—both understanding and generating. Encoder-decoder models (T5, the original Transformer) use bidirectional on the input and causal on the output. Most chatbots today are decoder-only.

What is a context window?

The context window is the maximum number of tokens a model can see at once. Think of it as the model's working memory—everything it can hold in mind when generating its next response.

When you talk to an AI chatbot, the entire conversation history gets packed into this window. Your first message, the AI's reply, your follow-up, system instructions—all of it consumes tokens in a fixed-size window.

Context window sizes: GPT-3 had a 2,048-token context window (roughly 1,500 words). GPT-4 launched at 8,192 tokens, later expanding to 128,000 with GPT-4 Turbo. Claude can handle 200,000+. Bigger windows mean the model can hold more in mind at once—reading longer documents, maintaining longer conversations, considering more context.

When a conversation exceeds the context window, older messages get truncated—they literally disappear from the model's view. This is why chatbots "forget" things you said earlier in a long conversation. They aren't forgetting. The information has scrolled out of their window.

Multi-head attention: looking for different things at once

Real models don't run just one attention pattern. They run dozens in parallel, called attention heads. Each head learns to look for different types of relationships:

One head might focus on grammatical structure—connecting verbs to their subjects.
Another might track entity references—connecting "she" back to "Alice."
Another might pick up semantic similarity—connecting related concepts.
Some heads seem to serve no obvious purpose, yet removing them hurts performance.

These perspectives get combined. The model builds a layered picture of how each word relates to every other word, all computed at the same time.

Layers: attention on top of attention

A transformer doesn't compute attention just once. It stacks dozens or even hundreds of layers, each building on the one before. Early layers tend to capture simple patterns—grammar, local word relationships. Deeper layers pick up increasingly abstract things like sentiment, argument structure, and factual associations.

By the time information flows through all these layers, the model has built a detailed representation of the input. Each word's embedding has been enriched with context from every other word, through multiple rounds of attention at multiple levels of abstraction.

The big picture: The transformer architecture—attention plus stacked layers—is why modern AI feels qualitatively different from earlier systems. It can hold entire documents in mind, track complex relationships across thousands of words, and build layered understanding from surface patterns to abstract meaning. All from the same core mechanism: letting every word look at every other word.

We've covered how text becomes numbers (Chapter 1), how the model learns patterns (Chapter 2), and how attention lets it consider context (this chapter). One question left: if the model is "just" predicting the next word, how does it reason, create, and solve problems?