Vector Mixtures in Embeddings and Attention

A concrete application page showing how weighted vector combinations become embeddings, pooled representations, and attention outputs.

Modified

April 26, 2026

Keywords

application, vectors, embeddings, attention, weighted sum

1 Application Snapshot

One of the most common operations in modern ML is not mysterious at all:

take several vectors, weight them, and add them.

That is exactly a linear combination.

This viewpoint lets you recognize the same math inside:

pooled embeddings
mixture models of features
attention outputs in transformers

2 Problem Setting

Suppose we have three token or feature vectors in \(\mathbb{R}^2\):

\[ v_1 = \begin{bmatrix} 1.0 \\ 0.2 \end{bmatrix}, \qquad v_2 = \begin{bmatrix} 0.4 \\ 1.1 \end{bmatrix}, \qquad v_3 = \begin{bmatrix} 1.2 \\ 0.8 \end{bmatrix}. \]

We want one summary representation built from these directions.

3 Why This Math Appears

If the weights are \(a_1,a_2,a_3\), then the representation

\[ z = a_1 v_1 + a_2 v_2 + a_3 v_3 \]

is a linear combination of the available vectors.

Different pipelines choose the weights differently:

in average pooling, they are fixed in advance
in learned mixtures, they are produced by another model component
in attention, they depend on similarity scores and are normalized before the sum is taken

The math object stays the same even when the modeling story changes.

4 Math Objects In Use

embedding vectors \(v_i\)
scalar weights \(a_i\)
weighted sum \(z = \sum_i a_i v_i\)
span of the available representation directions

5 Worked Walkthrough

Take weights

\[ a_1 = 0.2, \qquad a_2 = 0.5, \qquad a_3 = 0.3. \]

Then

\[ z = 0.2v_1 + 0.5v_2 + 0.3v_3 \]

becomes

\[ z = 0.2 \begin{bmatrix} 1.0 \\ 0.2 \end{bmatrix} + 0.5 \begin{bmatrix} 0.4 \\ 1.1 \end{bmatrix} + 0.3 \begin{bmatrix} 1.2 \\ 0.8 \end{bmatrix} = \begin{bmatrix} 0.76 \\ 0.83 \end{bmatrix}. \]

So the summary vector is not a new primitive object. It is assembled from the directions already present.

In transformer attention, the same pattern appears in the output for one query:

\[ \operatorname{Attn}(q, K, V) = \sum_{i=1}^m \alpha_i v_i, \]

where the coefficients \(\alpha_i\) come from similarity scores between the query and the keys.

So the output lives in the span of the value vectors.

6 Implementation or Computation Note

In code, these mixtures are often written as matrix multiplication:

\[ z = V^\top \alpha \qquad \text{or} \qquad Z = AV, \]

depending on orientation.

That matrix form hides the same basic story:

the columns or rows of \(V\) are the available directions
the weight vector or matrix chooses how much of each direction to use

For interpretation, it often helps to unpack the matrix product back into a linear-combination formula.

7 Failure Modes

the weights can make the representation hard to interpret even if the algebra is simple
if the source vectors are redundant, many different coefficients can produce the same output
if the value vectors miss an important direction, no weighting scheme can create it from nothing

8 Paper Bridge

Attention is All you Need - Paper bridge - the attention output is literally a weighted sum of value vectors.
Deep learning, transformers and graph neural networks: a linear algebra perspective - Second pass - current survey showing how vector spaces, embeddings, and linear mixing keep reappearing in modern AI systems.

9 Try It

Change the weights so one coefficient is negative. The output is still a linear combination, but it is no longer an average-like mixture.
Replace \(v_3\) by \(v_1+v_2\). Which outputs become representable in multiple ways?
Read the attention equation in a paper and rewrite it explicitly as a linear combination.

10 Sources and Further Reading

Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares - First pass - good applied source for weighted vector combinations and matrix-form thinking. Checked 2026-04-24.
Attention is All you Need - Paper bridge - canonical place to see weighted vector mixtures become an architecture-level operation. Checked 2026-04-24.
Deep learning, transformers and graph neural networks: a linear algebra perspective - Second pass - current bridge from vector language to representations, attention, and graph models. Checked 2026-04-24.