Paper Lab: Attention as Weighted Vector Mixture
paper reading, attention, vectors, embeddings
1 Why This Paper
Use this paper lab when you want your first research-facing example of a famous ML paper whose core linear-algebra move is still simple:
build one vector by weighting and summing other vectors.
The anchor paper is:
2 What To Know First
- what a linear combination is
- how a matrix can store a collection of vectors
- why changing weights changes the output without changing the underlying span
3 First Pass
On a first pass, ignore most architecture details and track only one equation:
\[ \sum_i \alpha_i v_i. \]
That is the attention output for one query: a weighted sum of value vectors.
The main story is:
- similarity scores produce weights
- weights choose how much each value vector matters
- the output representation lies in the span of the value vectors
4 Second Pass
The real mathematical objects to track are:
- query vector \(q\)
- key vectors \(k_i\)
- value vectors \(v_i\)
- weights \(\alpha_i\)
- output vector \(\sum_i \alpha_i v_i\)
At this pass, separate two roles:
linear map role: learned projections create queries, keys, and valuesvector-mixture role: the final output is a weighted combination of value vectors
That separation helps you see which part belongs to the Vectors topic and which part belongs more naturally to Matrices and Linear Maps.
5 Math Dependency Map
Read this page after:
6 Key Claims and Evidence
The paper’s main architectural claim is not itself a linear-algebra theorem.
But the linear-algebra object inside the architecture is very clean:
- the output is a weighted sum of value vectors
- the learned projections are matrix maps
- multi-head attention repeats this pattern in parallel subspaces
The evidence in the paper is mainly experimental and architectural, not theorem-based.
7 What To Reproduce
A good reproduction target is tiny:
- choose three value vectors in \(\mathbb{R}^d\)
- choose one query and three keys
- compute attention weights
- form the weighted sum explicitly
- verify that the output lies in the span of the values
That reproduction target is small, but it teaches the main bridge idea.
8 What Has Changed Since Publication
Since the original paper, attention has branched into:
- larger transformer architectures
- efficient or sparse attention variants
- graph and multimodal attention mechanisms
But the weighted-vector-mixture view still survives as a first reading lens.
9 Sources and Further Reading
- Attention is All you Need -
Paper bridge- the anchor paper for this lab. Checked2026-04-24. - Deep learning, transformers and graph neural networks: a linear algebra perspective -
Second pass- current survey context for seeing attention through a linear-algebra lens. Checked2026-04-24. - Vector Mixtures in Embeddings and Attention -
First pass- site page that isolates the weighted-sum core before the architecture details.