Paper Lab: Attention as Weighted Vector Mixture

A guided reading page for seeing the linear-combination heart of transformer attention inside the original paper.
Modified

April 26, 2026

Keywords

paper reading, attention, vectors, embeddings

1 Why This Paper

Use this paper lab when you want your first research-facing example of a famous ML paper whose core linear-algebra move is still simple:

build one vector by weighting and summing other vectors.

The anchor paper is:

2 What To Know First

  • what a linear combination is
  • how a matrix can store a collection of vectors
  • why changing weights changes the output without changing the underlying span

3 First Pass

On a first pass, ignore most architecture details and track only one equation:

\[ \sum_i \alpha_i v_i. \]

That is the attention output for one query: a weighted sum of value vectors.

The main story is:

  1. similarity scores produce weights
  2. weights choose how much each value vector matters
  3. the output representation lies in the span of the value vectors

4 Second Pass

The real mathematical objects to track are:

  • query vector \(q\)
  • key vectors \(k_i\)
  • value vectors \(v_i\)
  • weights \(\alpha_i\)
  • output vector \(\sum_i \alpha_i v_i\)

At this pass, separate two roles:

  • linear map role: learned projections create queries, keys, and values
  • vector-mixture role: the final output is a weighted combination of value vectors

That separation helps you see which part belongs to the Vectors topic and which part belongs more naturally to Matrices and Linear Maps.

5 Math Dependency Map

Read this page after:

6 Key Claims and Evidence

The paper’s main architectural claim is not itself a linear-algebra theorem.

But the linear-algebra object inside the architecture is very clean:

  • the output is a weighted sum of value vectors
  • the learned projections are matrix maps
  • multi-head attention repeats this pattern in parallel subspaces

The evidence in the paper is mainly experimental and architectural, not theorem-based.

7 What To Reproduce

A good reproduction target is tiny:

  1. choose three value vectors in \(\mathbb{R}^d\)
  2. choose one query and three keys
  3. compute attention weights
  4. form the weighted sum explicitly
  5. verify that the output lies in the span of the values

That reproduction target is small, but it teaches the main bridge idea.

8 What Has Changed Since Publication

Since the original paper, attention has branched into:

  • larger transformer architectures
  • efficient or sparse attention variants
  • graph and multimodal attention mechanisms

But the weighted-vector-mixture view still survives as a first reading lens.

9 Sources and Further Reading

Back to top