Attention, Softmax, and Weighted Mixtures
attention, softmax, weighted mixtures, transformers, scaled dot-product attention
1 Application Snapshot
Modern attention layers look complicated when written in full architecture diagrams, but one core operation is very simple:
score several value vectors, normalize the scores with softmax, and take a weighted sum
That means attention is not a completely new mathematical species. It is a structured weighted mixture.
2 Problem Setting
Suppose one query vector \(q\) looks at keys \(k_1,\dots,k_m\) and values \(v_1,\dots,v_m\).
Scaled dot-product attention computes scores
\[ s_i = \frac{q^\top k_i}{\sqrt{d_k}}, \]
turns them into weights with softmax,
\[ \alpha_i = \frac{e^{s_i}}{\sum_{j=1}^m e^{s_j}}, \]
and then returns the attention output
\[ z = \sum_{i=1}^m \alpha_i v_i. \]
So the flow is:
- compare query and keys
- normalize scores into nonnegative weights
- mix the value vectors using those weights
3 Why This Math Appears
This page sits at the intersection of several pages already on the site:
- Vector Mixtures in Embeddings and Attention: the output is still a weighted vector sum
- Learned Linear Projections in Transformers: the queries, keys, and values usually come from learned linear maps
- Backpropagation and Computation Graphs: the score, softmax, and weighted sum all sit inside a differentiable computation graph
So attention is best understood as:
- a scoring rule
- a normalization rule
- a weighted-mixture rule
instead of as a single mysterious black box.
4 Math Objects In Use
- query vector \(q\)
- key vectors \(k_i\)
- value vectors \(v_i\)
- similarity scores \(s_i\)
- softmax weights \(\alpha_i\)
- attention output \(z\)
5 A Small Worked Walkthrough
Take one query and three key-value pairs:
\[ q = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \qquad k_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \qquad k_2 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \qquad k_3 = \begin{bmatrix} -1 \\ 0 \end{bmatrix}. \]
Let \(d_k=1\), so the scale factor is just \(1\). Then the scores are
\[ s_1 = q^\top k_1 = 1, \qquad s_2 = q^\top k_2 = 0, \qquad s_3 = q^\top k_3 = -1. \]
Softmax turns these into weights
\[ \alpha = \operatorname{softmax}(1,0,-1) \approx (0.665,\;0.245,\;0.090). \]
Now choose values
\[ v_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \qquad v_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \qquad v_3 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}. \]
Then the output is
\[ z = 0.665\,v_1 + 0.245\,v_2 + 0.090\,v_3 \approx \begin{bmatrix} 0.755 \\ 0.335 \end{bmatrix}. \]
This example shows three useful facts:
- the output is still a linear combination of value vectors
- because the softmax weights are nonnegative and sum to one, the output is a convex combination
- changing the scores changes the mixture, not the basic algebraic form
6 Implementation or Computation Note
For many queries at once, attention is usually written in matrix form:
\[ A = \operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right), \qquad Z = AV. \]
This compact formula hides the same story:
- \(QK^\top\) computes all query-key scores
- softmax converts each row into normalized weights
- multiplication by \(V\) forms weighted mixtures of the value vectors
In real transformers, masking is also crucial. It prevents attention from using padded positions or future tokens that should not be visible yet.
7 Failure Modes
- reading attention as if it created arbitrary new directions, when one head still outputs a mixture of existing value vectors
- forgetting that the weights come from learned projections, not raw tokens alone
- treating softmax weights as a direct explanation of causal importance
- ignoring masking, which changes which positions can actually receive weight
- confusing the attention mechanism itself with the entire transformer block around it
8 Paper Bridge
- Attention Is All You Need -
Paper bridge- the canonical source for scaled dot-product attention and the matrix form used in transformers. Checked2026-04-24. - 11.3 Attention Scoring Functions -
First pass- open technical chapter that isolates the scoring, softmax, and weighted-sum pieces clearly. Checked2026-04-24.
9 Sources and Further Reading
- 11.1 Queries, Keys, and Values -
First pass- clear open introduction to the query-key-value picture before the full transformer block. Checked2026-04-24. - 11.3 Attention Scoring Functions -
First pass- open chapter showing how similarity scores, masked softmax, and weighted averages fit together. Checked2026-04-24. - Attention Is All You Need -
Paper bridge- original transformer paper with the scaled dot-product attention equation. Checked2026-04-24. - CME 295: Transformers & Large Language Models -
Second pass- current official Stanford course hub for readers who want a broader transformer/LLM path after the algebraic core is clear. Checked2026-04-24.