Linear Probes and Representation Diagnostics

A bridge page showing how linear probes test what a frozen representation makes linearly decodable, and how to read probe results without overclaiming what a model knows.
Modified

April 26, 2026

Keywords

linear probe, representation diagnostics, transfer learning, frozen features, decodability

1 Application Snapshot

A linear probe asks a narrow but useful question:

if we freeze the representation, what information is already easy to decode with a linear head?

That makes linear probing one of the simplest diagnostics for representation quality.

It is especially useful when you want to compare:

  • different layers of the same model
  • different pretraining methods
  • zero-shot behavior versus learned downstream adaptation
  • frozen-feature transfer versus full fine-tuning

2 Problem Setting

Suppose a pretrained model maps an input \(x\) to a hidden representation

\[ z(x) \in \mathbb{R}^d. \]

We freeze that representation and train only a linear predictor

\[ \hat{y} = W z(x) + b \]

or, for multiclass classification, a softmax head built from those logits.

The encoder is not updated. Only \(W\) and \(b\) are trained on the downstream labels.

So a probe does not ask:

can the whole model solve the task after end-to-end adaptation?

It asks:

does this frozen representation already make the task linearly accessible?

3 Why This Math Appears

This page sits on top of several earlier bridges:

So linear probing is where representation geometry meets evaluation discipline.

4 Math Objects In Use

  • frozen representation \(z(x)\)
  • linear head \(W z + b\)
  • training and validation splits
  • accuracy, cross-entropy, or another downstream metric
  • layer index if we probe multiple hidden states
  • control baselines that help separate real signal from probe memorization

5 A Small Worked Walkthrough

Suppose a frozen encoder maps four inputs into two-dimensional vectors:

\[ z(x_1) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \qquad z(x_2) = \begin{bmatrix} 1.5 \\ 0.5 \end{bmatrix}, \qquad z(x_3) = \begin{bmatrix} -1 \\ -1 \end{bmatrix}, \qquad z(x_4) = \begin{bmatrix} -1.5 \\ -0.5 \end{bmatrix}. \]

Assume \(x_1,x_2\) belong to class A and \(x_3,x_4\) belong to class B.

A linear probe with

\[ w = \begin{bmatrix} 1 \\ 0.5 \end{bmatrix}, \qquad b = 0 \]

computes the score \(s(x)=w^\top z(x)\).

Then

\[ s(x_1)=2.5,\quad s(x_2)=1.75,\quad s(x_3)=-1.5,\quad s(x_4)=-1.75. \]

So a single linear separator already splits the classes.

The important conclusion is not that the model “understands class A.” The narrower conclusion is:

  • the frozen representation places the two classes in a geometry that a linear rule can separate
  • a downstream task may therefore require only a small head rather than a full model rewrite

Now imagine probing two layers of the same network:

  • an early layer gives probe accuracy near random
  • a later layer gives strong validation accuracy

That suggests the later layer has made task-relevant information more linearly available. It still does not prove that the model internally uses the exact same linear rule during its native prediction pipeline.

6 Implementation or Computation Note

A practical probe workflow usually looks like this:

  1. choose one or more frozen layers
  2. extract representations on a clean train / validation / test split
  3. train only a linear head
  4. compare validation and test behavior across layers or models

Useful diagnostics include:

  • layerwise probe accuracy
  • train versus validation gap
  • zero-shot versus linear-probe versus fine-tuned performance
  • probe performance under small data budgets
  • control tasks or random-label baselines

This is why probing is not just “train a tiny classifier.” It is an evaluation design problem.

For modern foundation models, linear probes are often used because they are cheap, reproducible, and less confounded than full fine-tuning. They also let you ask whether a representation is already useful before spending compute on larger adaptation.

7 Failure Modes

  • treating probe accuracy as proof that the model causally uses that feature
  • using a probe that is too expressive, so the probe learns the task instead of revealing the representation
  • comparing probes across models with mismatched data budgets or preprocessing
  • ignoring train / validation / test leakage
  • reading tiny accuracy differences as strong structural conclusions
  • forgetting that high probe accuracy can still coexist with poor robustness or poor calibration

One especially important caution is this:

linear decodability is evidence about accessible information, not a full theory of representation meaning

8 Paper Bridge

9 Sources and Further Reading

Back to top