Vector Mixtures in Embeddings and Attention

A concrete application page showing how weighted vector combinations become embeddings, pooled representations, and attention outputs.
Modified

April 26, 2026

Keywords

application, vectors, embeddings, attention, weighted sum

1 Application Snapshot

One of the most common operations in modern ML is not mysterious at all:

take several vectors, weight them, and add them.

That is exactly a linear combination.

This viewpoint lets you recognize the same math inside:

  • pooled embeddings
  • mixture models of features
  • attention outputs in transformers

2 Problem Setting

Suppose we have three token or feature vectors in \(\mathbb{R}^2\):

\[ v_1 = \begin{bmatrix} 1.0 \\ 0.2 \end{bmatrix}, \qquad v_2 = \begin{bmatrix} 0.4 \\ 1.1 \end{bmatrix}, \qquad v_3 = \begin{bmatrix} 1.2 \\ 0.8 \end{bmatrix}. \]

We want one summary representation built from these directions.

3 Why This Math Appears

If the weights are \(a_1,a_2,a_3\), then the representation

\[ z = a_1 v_1 + a_2 v_2 + a_3 v_3 \]

is a linear combination of the available vectors.

Different pipelines choose the weights differently:

  • in average pooling, they are fixed in advance
  • in learned mixtures, they are produced by another model component
  • in attention, they depend on similarity scores and are normalized before the sum is taken

The math object stays the same even when the modeling story changes.

4 Math Objects In Use

  • embedding vectors \(v_i\)
  • scalar weights \(a_i\)
  • weighted sum \(z = \sum_i a_i v_i\)
  • span of the available representation directions

5 Worked Walkthrough

Take weights

\[ a_1 = 0.2, \qquad a_2 = 0.5, \qquad a_3 = 0.3. \]

Then

\[ z = 0.2v_1 + 0.5v_2 + 0.3v_3 \]

becomes

\[ z = 0.2 \begin{bmatrix} 1.0 \\ 0.2 \end{bmatrix} + 0.5 \begin{bmatrix} 0.4 \\ 1.1 \end{bmatrix} + 0.3 \begin{bmatrix} 1.2 \\ 0.8 \end{bmatrix} = \begin{bmatrix} 0.76 \\ 0.83 \end{bmatrix}. \]

So the summary vector is not a new primitive object. It is assembled from the directions already present.

In transformer attention, the same pattern appears in the output for one query:

\[ \operatorname{Attn}(q, K, V) = \sum_{i=1}^m \alpha_i v_i, \]

where the coefficients \(\alpha_i\) come from similarity scores between the query and the keys.

So the output lives in the span of the value vectors.

6 Implementation or Computation Note

In code, these mixtures are often written as matrix multiplication:

\[ z = V^\top \alpha \qquad \text{or} \qquad Z = AV, \]

depending on orientation.

That matrix form hides the same basic story:

  • the columns or rows of \(V\) are the available directions
  • the weight vector or matrix chooses how much of each direction to use

For interpretation, it often helps to unpack the matrix product back into a linear-combination formula.

7 Failure Modes

  • the weights can make the representation hard to interpret even if the algebra is simple
  • if the source vectors are redundant, many different coefficients can produce the same output
  • if the value vectors miss an important direction, no weighting scheme can create it from nothing

8 Paper Bridge

9 Try It

  1. Change the weights so one coefficient is negative. The output is still a linear combination, but it is no longer an average-like mixture.
  2. Replace \(v_3\) by \(v_1+v_2\). Which outputs become representable in multiple ways?
  3. Read the attention equation in a paper and rewrite it explicitly as a linear combination.

10 Sources and Further Reading

Back to top