Learned Linear Projections in Transformers
application, matrices, linear maps, transformers, projections
1 Application Snapshot
Transformers are full of nonlinearities and normalization layers, but the linear algebra core keeps showing up in one very clean form:
\[ x \mapsto Wx. \]
That is a learned linear map.
2 Problem Setting
Suppose an input token representation is a vector
\[ x \in \mathbb{R}^d. \]
To prepare it for attention, the model applies learned matrices:
\[ q = W_Q x, \qquad k = W_K x, \qquad v = W_V x. \]
These are three different linear maps from the same input space into three learned feature spaces.
3 Why This Math Appears
The operator viewpoint tells you what the model is doing:
- the input vector is not merely stored, it is transformed
- each matrix chooses a different learned coordinate system
- composition of layers means composition of maps, which is matrix multiplication before nonlinearities are inserted
So the linear-algebra object is not only the matrix itself, but the rule it represents.
4 Math Objects In Use
- input vector \(x\)
- projection matrices \(W_Q, W_K, W_V\)
- linear maps between feature spaces
- matrix multiplication as coordinate computation
5 Worked Walkthrough
Take a toy input vector
\[ x = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \]
and projection matrices
\[ W_Q = \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix}, \qquad W_K = \begin{bmatrix} 2 & -1 \\ 0 & 1 \end{bmatrix}, \qquad W_V = \begin{bmatrix} 1 & 1 \\ 0 & 2 \end{bmatrix}. \]
Then
\[ q = W_Q x = \begin{bmatrix} 1 \\ 3 \end{bmatrix}, \qquad k = W_K x = \begin{bmatrix} 0 \\ 2 \end{bmatrix}, \qquad v = W_V x = \begin{bmatrix} 3 \\ 4 \end{bmatrix}. \]
So one vector has been sent through three different linear maps, each designed for a different role in the attention block.
The point is not the toy numbers. It is that the matrices encode different operator behaviors.
6 Implementation or Computation Note
Real code often batches many token vectors at once. Then the same math becomes
\[ Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V, \]
or the transposed equivalent, depending on the convention.
The batch form can make the operator viewpoint harder to see, so it is helpful to remember that each row or column is still being sent through the same learned linear map.
7 Failure Modes
- looking only at matrix entries instead of asking what the map does to directions
- forgetting that the learned map depends on the chosen coordinate system of the feature space
- over-interpreting one projection matrix without considering the rest of the block and the following nonlinear pieces
8 Paper Bridge
- Attention is All you Need -
Paper bridge- the canonical source where learned linear projections into query, key, and value spaces are explicit. - Deep learning, transformers and graph neural networks: a linear algebra perspective -
Second pass- current survey framing these learned projections as matrix and operator objects inside larger architectures.
9 Try It
- Replace one of the matrices by the identity and describe what role is removed.
- Compose two learned matrices and compare the single combined map with the two-step description.
- Read one transformer diagram and rewrite each projection box as a linear map.
10 Sources and Further Reading
- Attention is All you Need -
First pass- the cleanest architecture paper for seeing learned linear projections in action. Checked2026-04-24. - Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares -
Second pass- good computational source for the operator viewpoint behind matrix multiplication. Checked2026-04-24. - Deep learning, transformers and graph neural networks: a linear algebra perspective -
Paper bridge- current bridge from basic linear maps to modern model architectures. Checked2026-04-24.