Attention, from first principles

The problem attention was built to solve

Before 2017, the dominant architecture for sequence modelling was recurrent: LSTMs, GRUs, and their variants processed a sentence one token at a time, folding every prior token's influence into a single evolving hidden state. This worked — famously well on translation and speech — but suffered from two deep problems. Sequential training meant you could not parallelize along the sequence axis; training was bound by sequence length, not by flops. And information dilution meant that a token a thousand steps ago had to survive a thousand subsequent updates to the hidden state, which in practice it rarely did.

MMXXVI

historical note

June 2017 · Vaswani et al., Google Brain

“Attention Is All You Need” proposed replacing recurrence entirely with a new primitive — scaled dot-product attention — that let every token look at every other token directly, in parallel, in a single layer. The paper won NeurIPS best paper, launched the transformer era, and is now the most-cited AI paper of the 2020s. Every modern language model — Llama, Qwen, Phi, Gemma, GPT, Claude — is a descendant of this single architectural decision.

◆ paper

Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · 2017 · NeurIPS 2017

arxiv:1706.03762

The formula we're going to earn from first principles in the rest of this lesson —

\text{softmax}(QK^\top/\sqrt{d_k})V

— appears on page 4 of this paper. The entire modern transformer story, good and bad, descends from those three lines of math.

Attention throws the recurrent contract out. Instead of forcing every token to remember every other token, it lets each token look up the tokens it needs, directly. Whatever comes from the lookup is what the token is told. The thing being looked up is a key; the thing doing the looking is a query; the thing returned is a value.

On the right, six tokens — the, cat, sat, on, mat, quietly — sit scattered in a tiny 2-dimensional space. The copper dot labelled $Q$ is a query. Drag it around.Notice how the arrows redraw live. Every arrow's thickness is the attention weight that this query is assigning to that key.

In a real transformer the tokens live in a 3072-dimensional space (for Phi-4-mini), not two. You cannot see that on a page. Everything else in this lesson — dot product, scaling, softmax, weighted sum — is literally true in any dimensionality.

Dot product = alignment

For each key token, we compute a raw attention score. The score for key $K_i$ is the dot product of the query with the key:

s_i \;=\; Q \cdot K_i \;=\; q_x k_{i,x} + q_y k_{i,y}

Look at what happens as you move $Q$ :

When $Q$ points in the same direction as $K_i$ , the score is big and positive.
When $Q$ is perpendicular to $K_i$ , the score is roughly zero.
When it points the opposite way, the score is negative.

This is not a metaphor. The dot product is the measurement of alignment between two vectors.

the geometric identity

Q \cdot K \;=\; \|Q\|\,\|K\|\,\cos\theta

Fix the lengths of $Q$ and $K$ and the dot product becomes a scaled cosine similarity. An attention score is, at its core, a statement about how much two directions in representation space agree.

In a real transformer,

Q

K_i

and

V_i

are not the token embeddings themselves. They are learned linear projections of them:

Q = X W_Q,\; K = X W_K,\; V = X W_V

. The projection matrices

W_Q, W_K, W_V

are what the model learns. The geometry you're looking at is post-projection geometry — the model learned to project tokens into a space where dot products measure exactly the retrieval direction it cares about.

Why we divide by $\sqrt{d_k}$

Here is where first-principles thinking pays off. In low dimension, raw dot products behave nicely. In 1,000+ dimensional attention, they can get enormous. If you push large scores through a softmax, the softmax becomes extremely peaked — one token gets nearly all the probability and everything else gets vanishingly small gradients. Training stalls.

The fix from Vaswani et al. is to divide every score by $\sqrt{d_k}$ before the softmax, where $d_k$ is the key dimensionality. Where does $\sqrt{d_k}$ come from? Not magic — it falls out of a standard variance calculation.

Here in our 2-dimensional playground, $d_k = 2$ , so the scale factor is a modest $\sqrt{2} \approx 1.41$ . For standard transformer head widths $d_k = 64$ , it's $8$ — much bigger. But the principle is identical: normalize away the dimension dependence.

without √d scaling

Scores grow with head dimension. At $d_k = 64$ , scores are eight times larger than at $d_k = 1$ . The softmax becomes a one-hot spike on the argmax. Gradients vanish on all non-argmax tokens. Deep transformers cannot learn to distribute attention.

with √d scaling

Scaled scores have unit variance regardless of $d_k$ . The softmax produces a usable distribution for any head size. The model can spread attention over many tokens when it needs to. Training works.

Softmax: from scores to a probability distribution

Now we turn scaled scores into a probability distribution over the keys. The softmax function does exactly that:

\alpha_i \;=\; \frac{\exp(\tilde s_i)}{\sum_j \exp(\tilde s_j)}

Three properties worth memorising:

Every $\alpha_i$ is between 0 and 1.
They sum to exactly 1 — it's a distribution.
Softmax is monotonic and smooth: higher score → higher $\alpha$ , and infinitesimally differentiable everywhere.

The weights are now on screen. Keys that $Q$ is pointing toward grab most of the probability mass, and everything else gets a little. Drag $Q$ and watch the bar chart at the bottom — the bars redistribute continuously.

Weighted sum of values — the actual layer output

The last step is almost anticlimactic. For each key, there is a corresponding value vector $V_i$ . The output of the attention layer for this query is the attention-weighted average of the values:

\text{output} \;=\; \sum_{i} \alpha_i \, V_i

In the diagram we've simplified for pedagogical clarity — we're treating $V_i = K_i$ , so the output vector is literally a weighted average of the key positions. You can see it now as a teal dot connected to the origin by a dashed line. Drag $Q$ and the output vector smoothly sweeps toward whichever keys are grabbing most of the attention mass.

In a real transformer

V_i \neq K_i

; the value projection

W_V

is separate from

W_K

. The model learns to encode what to retrieve in values and how to find it in keys. Keys are the addressable features; values are the content. Splitting them gives the model more expressive power — the same content (value) can be retrieved by multiple different addressable features (keys), and the same lookup pattern (query) can extract different content depending on which value projection is in play.

The whole formula, in one line

\text{Attention}(Q, K, V) \;=\; \text{softmax}\!\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V

That is the most famous equation in deep learning since 2017. Every component you just saw is in it:

QK^\top

is all the dot products stacked into a matrix. Rows are queries, columns are keys, each cell is one alignment score.

/\sqrt{d_k}

is the variance normalization we just derived.

\text{softmax}(\cdot)

runs along each row independently, turning the scores into a distribution over keys for every query.

The final multiply by

V

is the weighted sum of value vectors — every query ends up with its own blended output.

Self-attention is what happens when

Q

K

, and

V

are all derived from the same input sequence (each token looking up every other token in its own context). Cross-attention is when they come from different sources — queries from a decoder, keys and values from an encoder. The math is identical; only the data-flow differs.

the whole attention operation, as one picture

The attention matrix

Every transformer paper draws this picture. Most tutorials just show it — rarely do they let you actually play with it. Here is a 6-token sentence, the cat sat on the mat, with its full $6 \times 6$ attention matrix. Every row is a query asking “who should I attend to?”; every column is a key saying “you can look at me”; every cell is the attention weight that this query assigns to that key. Click any row to highlight its query, and toggle whether you're seeing raw scores or softmax-normalised probabilities, and whether the causal mask is on or off.

click a row to highlight a query

reading row 3 (sat)

When sat is the current query, its attention weights (masked past position 3, so only the first 3 tokens are visible) land on: cat (61%), sat (25%), the (14%). The output for this query is the weighted sum of the value vectors with these weights.

Three things worth noticing

Causal masking is just a triangle. When you turn the mask on, the upper-right triangle gets blanked out — a token cannot attend to tokens that come after it. In software, this is implemented by adding $-\infty$ to the scores of masked positions before softmax, so those cells become zero after normalisation.
Rows always sum to 1 (when softmax is on). Each row is its own probability distribution over which keys it attends to, independent of the other rows. The softmax is applied row-wise, not over the whole matrix.
The actual patterns are learned. Notice that sat (row 3) attends heavily to cat(column 2). That's a subject-verb relationship, and real trained attention heads learn exactly this kind of pattern without being told to. The hand-crafted scores above are just a plausible imitation — real ones are more structured and more interesting.

The matrix you're looking at is the

6 \times 6

case. In a real sequence of length

L = 8192

, this matrix is

8192 \times 8192

— which is 67 million cells, all of which need to be computed and stored during training. The memory footprint of this matrix is the reason FlashAttention exists.

one more layer of rigour

Beyond the formula — the parts most tutorials skip

Some explanations hand-wave “we divide by √d to prevent the softmax from saturating.” That's true but imprecise. The real reason is more subtle.

Consider the gradient of softmax with respect to one of its inputs:

\frac{\partial \alpha_i}{\partial s_j} \;=\; \alpha_i (\delta_{ij} - \alpha_j)

When the softmax is peaked (some $\alpha_k \approx 1$ ), all other $\alpha_i \approx 0$ , so $\alpha_i(\delta_{ij} - \alpha_j) \approx 0$ for most pairs. The gradient to non-argmax tokens is essentially zero. The model cannot learn from those tokens — it's stuck attending to whatever it happened to land on at init.

√d scaling keeps the softmax in a regime where non-argmax weights are nonzero and gradients can flow through multiple tokens. It's a training-dynamics fix, not just a numerical one.

comprehension check

Four questions. Answer three to earn the lesson.

comprehension · 1 / 4