The problem attention was built to solve
Before 2017, the dominant architecture for sequence modelling was recurrent: LSTMs, GRUs, and their variants processed a sentence one token at a time, folding every prior token's influence into a single evolving hidden state. This worked — famously well on translation and speech — but suffered from two deep problems. Sequential training meant you could not parallelize along the sequence axis; training was bound by sequence length, not by flops. And information dilution meant that a token a thousand steps ago had to survive a thousand subsequent updates to the hidden state, which in practice it rarely did.
Attention throws the recurrent contract out. Instead of forcing every token to remember every other token, it lets each token look up the tokens it needs, directly. Whatever comes from the lookup is what the token is told. The thing being looked up is a key; the thing doing the looking is a query; the thing returned is a value.
On the right, six tokens — the, cat, sat, on, mat, quietly — sit scattered in a tiny 2-dimensional space. The copper dot labelled is a query. Drag it around.Notice how the arrows redraw live. Every arrow's thickness is the attention weight that this query is assigning to that key.
Dot product = alignment
For each key token, we compute a raw attention score. The score for key is the dot product of the query with the key:
Look at what happens as you move :
- When points in the same direction as , the score is big and positive.
- When is perpendicular to , the score is roughly zero.
- When it points the opposite way, the score is negative.
This is not a metaphor. The dot product is the measurement of alignment between two vectors.
Fix the lengths of and and the dot product becomes a scaled cosine similarity. An attention score is, at its core, a statement about how much two directions in representation space agree.
Why we divide by
Here is where first-principles thinking pays off. In low dimension, raw dot products behave nicely. In 1,000+ dimensional attention, they can get enormous. If you push large scores through a softmax, the softmax becomes extremely peaked — one token gets nearly all the probability and everything else gets vanishingly small gradients. Training stalls.
The fix from Vaswani et al. is to divide every score by before the softmax, where is the key dimensionality. Where does come from? Not magic — it falls out of a standard variance calculation.
Here in our 2-dimensional playground, , so the scale factor is a modest . For standard transformer head widths , it's — much bigger. But the principle is identical: normalize away the dimension dependence.
Scores grow with head dimension. At , scores are eight times larger than at . The softmax becomes a one-hot spike on the argmax. Gradients vanish on all non-argmax tokens. Deep transformers cannot learn to distribute attention.
Scaled scores have unit variance regardless of . The softmax produces a usable distribution for any head size. The model can spread attention over many tokens when it needs to. Training works.
Softmax: from scores to a probability distribution
Now we turn scaled scores into a probability distribution over the keys. The softmax function does exactly that:
Three properties worth memorising:
- Every is between 0 and 1.
- They sum to exactly 1 — it's a distribution.
- Softmax is monotonic and smooth: higher score → higher, and infinitesimally differentiable everywhere.
The weights are now on screen. Keys that is pointing toward grab most of the probability mass, and everything else gets a little. Drag and watch the bar chart at the bottom — the bars redistribute continuously.
Weighted sum of values — the actual layer output
The last step is almost anticlimactic. For each key, there is a corresponding value vector . The output of the attention layer for this query is the attention-weighted average of the values:
In the diagram we've simplified for pedagogical clarity — we're treating , so the output vector is literally a weighted average of the key positions. You can see it now as a teal dot connected to the origin by a dashed line. Drag and the output vector smoothly sweeps toward whichever keys are grabbing most of the attention mass.
The whole formula, in one line
That is the most famous equation in deep learning since 2017. Every component you just saw is in it:
The attention matrix
Every transformer paper draws this picture. Most tutorials just show it — rarely do they let you actually play with it. Here is a 6-token sentence, the cat sat on the mat, with its full attention matrix. Every row is a query asking “who should I attend to?”; every column is a key saying “you can look at me”; every cell is the attention weight that this query assigns to that key. Click any row to highlight its query, and toggle whether you're seeing raw scores or softmax-normalised probabilities, and whether the causal mask is on or off.
Three things worth noticing
- Causal masking is just a triangle. When you turn the mask on, the upper-right triangle gets blanked out — a token cannot attend to tokens that come after it. In software, this is implemented by adding to the scores of masked positions before softmax, so those cells become zero after normalisation.
- Rows always sum to 1 (when softmax is on). Each row is its own probability distribution over which keys it attends to, independent of the other rows. The softmax is applied row-wise, not over the whole matrix.
- The actual patterns are learned. Notice that sat (row 3) attends heavily to cat(column 2). That's a subject-verb relationship, and real trained attention heads learn exactly this kind of pattern without being told to. The hand-crafted scores above are just a plausible imitation — real ones are more structured and more interesting.
Beyond the formula — the parts most tutorials skip
Some explanations hand-wave “we divide by √d to prevent the softmax from saturating.” That's true but imprecise. The real reason is more subtle.
Consider the gradient of softmax with respect to one of its inputs:
When the softmax is peaked (some ), all other , so for most pairs. The gradient to non-argmax tokens is essentially zero. The model cannot learn from those tokens — it's stuck attending to whatever it happened to land on at init.
√d scaling keeps the softmax in a regime where non-argmax weights are nonzero and gradients can flow through multiple tokens. It's a training-dynamics fix, not just a numerical one.