Representation Learning and Geometry of Embeddings
embeddings, representation learning, cosine similarity, geometry, vector spaces
1 Application Snapshot
Much of modern ML can be described as:
learn a vector representation in which useful similarities and directions become easy to use
That vector representation is an embedding.
Once an object has an embedding, many downstream operations reduce to geometry:
- nearest neighbors
- cosine similarity
- clustering
- linear classification
- weighted mixtures and retrieval
2 Problem Setting
Suppose a model maps each object \(x\) to a vector
\[ e(x) \in \mathbb{R}^d. \]
The goal is not just to compress the object into \(d\) numbers. It is to place related objects in a geometry that helps downstream tasks.
For example, we might want:
- similar words to have nearby vectors
- related images to form local neighborhoods
- classes to become easier to separate with a simple linear rule
So representation learning is often a search for a useful geometry, not only a search for a useful feature list.
3 Why This Math Appears
This page sits on top of several math pages already on the site:
- Vector Mixtures in Embeddings and Attention: embeddings are vectors that can be mixed, pooled, or queried
- Low-Dimensional Subspace Models: useful representations often concentrate signal into lower-dimensional structure
- Attention, Softmax, and Weighted Mixtures: attention reads and writes through embedding geometry
So the recurring math objects are not special to NLP or vision. They are:
- vector spaces
- angles and norms
- neighborhoods
- projections and subspaces
4 Math Objects In Use
embedding map \(e(x)\)
cosine similarity
\[ \cos(e_i,e_j) = \frac{e_i^\top e_j}{\|e_i\|\,\|e_j\|} \]
nearest-neighbor structure
low-dimensional directions or subspaces
linear probes or simple downstream classifiers
5 A Small Worked Walkthrough
Suppose three learned word embeddings are
\[ e(\text{cat}) = \begin{bmatrix} 1.0 \\ 0.9 \end{bmatrix}, \qquad e(\text{dog}) = \begin{bmatrix} 0.9 \\ 1.0 \end{bmatrix}, \qquad e(\text{car}) = \begin{bmatrix} -1.0 \\ 0.2 \end{bmatrix}. \]
The dot product between cat and dog is
\[ e(\text{cat})^\top e(\text{dog}) = 1.8, \]
while the dot product between cat and car is
\[ e(\text{cat})^\top e(\text{car}) = -0.82. \]
So even before any downstream classifier, the geometry already suggests:
catanddogare near each othercarlies in a different direction
That is the practical meaning of embedding geometry:
- related objects cluster
- irrelevant objects separate
- simple downstream rules become easier
In larger models, we rarely inspect vectors by hand. But the same geometric questions remain:
- what forms a neighborhood?
- which directions encode task-relevant variation?
- are the learned representations easy to separate or retrieve from?
6 Implementation or Computation Note
In practice, embedding quality is often checked indirectly through:
- nearest-neighbor queries
- retrieval quality
- clustering structure
- performance of a simple linear probe on top of frozen embeddings
The next natural page after this one is Linear Probes and Representation Diagnostics, which turns that last bullet into a more careful evaluation workflow.
This is important because a bigger model does not automatically imply a better geometry. Embeddings can also become anisotropic, overly collapsed, or highly task-specific in ways that harm transfer.
7 Failure Modes
- treating embeddings as mere storage vectors instead of a learned geometry
- reading cosine similarity as semantic truth rather than model-dependent geometry
- assuming visually clustered embeddings are automatically useful downstream
- ignoring that representation quality depends on the task and training signal
- forgetting that a linear probe can reveal whether useful structure is already present
8 Paper Bridge
- Efficient Estimation of Word Representations in Vector Space -
Paper bridge- classic word2vec paper showing how useful vector geometry can be learned from prediction tasks. Checked2026-04-24. - CS224N Lecture 2: Word Vectors, Word Senses, and Neural Classifiers -
First pass- official Stanford slide deck connecting embedding geometry to actual NLP practice. Checked2026-04-24.
9 Sources and Further Reading
- Stanford CS224N -
First pass- official course hub for modern NLP and the role of learned representations. Checked2026-04-24. - CS224N Lecture Notes: Word Vectors I -
First pass- official notes introducing the geometry and intuition of word vectors. Checked2026-04-24. - CS224N Lecture 2: Word Vectors, Word Senses, and Neural Classifiers -
First pass- official Stanford slides connecting learned embeddings to linear classifiers and downstream use. Checked2026-04-24. - Word Embedding (word2vec) -
Second pass- open text for seeing embeddings as learnable vector geometry. Checked2026-04-24.