In-Context Learning and Linearization

A bridge page showing how prompt examples act like a tiny dataset in context, and why simplified analyses often interpret in-context learning through linear regression or local linear updates.

Modified

April 26, 2026

Keywords

in-context learning, few-shot learning, prompting, linearization, transformers

1 Application Snapshot

Large language models can often solve a new task from examples written in the prompt, without any gradient update to the weights.

That behavior is called in-context learning (ICL).

At first it can feel mysterious. But one useful mathematical lens is much less mystical:

the prompt can behave like a tiny training set, and the model can behave as if it is fitting a simple predictor inside its activations

In the cleanest synthetic settings, that implicit predictor often looks surprisingly close to linear regression, ridge regression, or another simple update rule.

2 Problem Setting

Suppose a prompt contains demonstration pairs

\[ (x_1, y_1), \dots, (x_k, y_k) \]

followed by a query input \(x_\star\).

The model must produce an output \(y_\star\).

The important constraint is:

the model adapts to the task through the context, not by changing parameters

So the context window plays two roles at once:

it is input to the model
it also acts like a tiny task-specific dataset

This is different from standard supervised training, where the model absorbs data into its weights across optimization steps.

3 Why This Math Appears

This page sits on top of several earlier bridges:

Attention, Softmax, and Weighted Mixtures: attention lets the model compare a query with demonstrations and combine information from several examples
Representation Learning and Geometry of Embeddings: in-context learning depends on the geometry of hidden representations, not only on raw tokens
Linear Regression Through Projection: many clean ICL analyses use regression-like predictors as the simplest mathematical analogue

That is why linear algebra keeps reappearing in ICL papers:

demonstrations become vectors in context
attention computes structured comparisons
predictions can sometimes be approximated by linear maps or least-squares style estimators

4 Math Objects In Use

prompt examples \((x_i, y_i)\)
query example \(x_\star\)
hidden representation or embedding space
attention weights over contextual examples
implicit predictor induced by the prompt
linear estimator or local linearization

5 A Small Worked Walkthrough

Take a toy regression task where the hidden rule is

\[ y = 2x. \]

Suppose the prompt contains two demonstrations

\[ (1,2), \qquad (3,6), \]

and the query is

\[ x_\star = 4. \]

The correct answer is clearly \(8\).

Now write the demonstrations as a tiny linear-regression problem:

\[ X = \begin{bmatrix} 1 \\ 3 \end{bmatrix}, \qquad y = \begin{bmatrix} 2 \\ 6 \end{bmatrix}. \]

The least-squares fit for a one-dimensional slope is

\[ \hat{w} = (X^\top X)^{-1} X^\top y. \]

Here,

\[ X^\top X = 1^2 + 3^2 = 10, \qquad X^\top y = 1\cdot 2 + 3\cdot 6 = 20, \]

\[ \hat{w} = \frac{20}{10} = 2. \]

Then the prediction for the query is

\[ \hat{y}_\star = x_\star \hat{w} = 4 \cdot 2 = 8. \]

This does not mean a language model literally writes down and inverts a matrix every time it sees a few-shot prompt.

It means something more modest and more useful:

the prompt can encode a small supervised-learning problem
attention can route information from the demonstrations to the query
in simplified settings, the resulting behavior can match a linear predictor very closely

That is the linearization lens: analyze the in-context behavior by comparing it to a simple, mathematically transparent estimator.

6 Implementation or Computation Note

There are at least three different stories people tell about ICL:

Pattern continuation The model continues token patterns it has seen many times in pretraining.
Retrieval The model uses attention to pull relevant examples or templates from the prompt.
Implicit learning algorithm The model behaves as if it is fitting a small predictor inside its activations from the prompt examples.

In practice, all three can matter, and which one dominates depends on:

model scale
task type
prompt format
how close the task is to the pretraining distribution

This is also why linearization should be read carefully.

On this page, it means:

using a linear or locally linear mathematical model to explain part of the behavior
not claiming that all in-context learning in general LLMs is fully solved by linear regression

The cleanest evidence for the linearization view comes from controlled synthetic tasks such as linear regression, where researchers can compare transformer predictions directly with least squares, ridge, or gradient-based learners.

7 Failure Modes

assuming in-context learning is just memorization with a fancy name
assuming every case of ICL is literally gradient descent in hidden space
overgeneralizing from synthetic linear-regression tasks to all language-model reasoning
forgetting that prompt order, formatting, and tokenization can materially change behavior
treating attention weights as a full explanation rather than one partial view of the mechanism
confusing in-context learning with fine-tuning or weight updates

8 Paper Bridge

Language Models are Few-Shot Learners - First pass - classic source for the modern few-shot prompting phenomenon in large language models. Checked 2026-04-24.
Trained Transformers Learn Linear Models In-Context - Paper bridge - strong math-facing paper showing that, on synthetic regression tasks, transformers can match linear estimators surprisingly closely. Checked 2026-04-24.

9 Sources and Further Reading

CS324: Large Language Models - First pass - official Stanford course hub for the broader LLM setting in which in-context learning became central. Checked 2026-04-24.
CS224N Lecture 9: Pretraining - First pass - official Stanford slides that introduce GPT-style few-shot prompting in a broader NLP course context. Checked 2026-04-24.
Language Models are Few-Shot Learners - First pass - primary source for the empirical phenomenon of in-context learning in very large language models. Checked 2026-04-24.
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes - Second pass - primary source for studying ICL on controlled function classes such as linear functions and decision trees. Checked 2026-04-24.
What Learning Algorithm is In-Context Learning? Investigations with Linear Models - Second pass - influential primary paper investigating whether transformers implicitly implement regression-style learning algorithms in context. Checked 2026-04-24.
Trained Transformers Learn Linear Models In-Context - Paper bridge - strong mathematical bridge between ICL behavior and linear estimators. Checked 2026-04-24.